Class: Chinese::Vocab
- Inherits:
-
Object
- Object
- Chinese::Vocab
- Includes:
- HelperMethods, WithValidations
- Defined in:
- lib/chinese/vocab.rb
Constant Summary
- OPTIONS =
Mandatory constant for the Options module. Each key-value pair is of the following type: option_key => [default_value, validation]
{:compact => [false, lambda {|value| is_boolean?(value) }], :with_pinyin => [true, lambda {|value| is_boolean?(value) }], :thread_count => [8, lambda {|value| value.kind_of?(Integer) }]}
Instance Attribute Summary (collapse)
-
- (Boolean) compact
readonly
The value of the :compact options key.
-
- (Array<String>) not_found
readonly
of the supported online dictionaries during a call to either #sentences or #min_sentences.
-
- (Array<Hash>) stored_sentences
readonly
Holds the return value of either #sentences or #min_sentences, whichever was called last.
-
- (Boolean) with_pinyin
readonly
The value of the :with_pinyin option key.
-
- (Object) words
readonly
Returns the value of attribute words.
Class Method Summary (collapse)
-
+ (Array<String>) parse_words(path_to_csv, word_col, options = {})
Extracts the vocabulary column from a CSV file as an array of strings.
-
+ (Boolean) within_range?(column, row)
Input: column: word column number (counting from 1) row : Array of the processed CSV data that contains our word column.
Instance Method Summary (collapse)
- - (Object) add_key(hash_array, key, &block)
- - (Object) add_target_words(hash_array)
- - (Object) alternate_source(sources, selection)
- - (Boolean) contains_all_target_words?(selected_rows, sentence_key)
- - (Object) convert(text)
-
- (Object) edit_vocab(word_array)
Remove all non-word characters.
-
- (Vocab) initialize(word_array, options = {})
constructor
Intializes an object.
- - (Boolean) is_boolean?(value)
- - (Object) make_hash(*data)
- - (Array<Hash>, []) min_sentences(options)
- - (Object) remove_er_character_from_end(word)
- - (Object) remove_keys(hash_array, *keys)
-
- (Object) remove_parens(word)
Helper functions -----------------.
-
- (Object) remove_redundant_single_char_words(words)
Input: ["看", "书", "看书"] Output: ["看书"].
- - (Object) remove_slash(word)
- - (Object) select_minimum_necessary_sentences(sentences)
-
- (Object) select_sentence(word, options)
Uses options passed from #sentences.
-
- (Hash) sentences(options)
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary,.
-
- (Array<String>) sentences_unique_chars(sentences)
Finds the unique Chinese characters from either the data in #stored_sentences or an array of Chinese sentences passed as an argument.
- - (Object) sort_by_target_word_count(with_target_words)
- - (Object) target_words_per_sentence(sentence, words)
-
- to_csv(path_to_file, options = {})
Saves the data stored in #stored_sentences to disk.
- - (Object) try_alternate_download_sources(alternate_sources, word, options)
- - (Object) uwc_tag(string)
Methods included from HelperMethods
#distinct_words, #include_every_char?, #is_unicode?
Constructor Details
- (Vocab) initialize(word_array, options) - (Vocab) initialize(word_array)
Intializes an object.
52 53 54 55 56 57 58 59 |
# File 'lib/chinese/vocab.rb', line 52 def initialize(word_array, ={}) @compact = validate { :compact } @words = edit_vocab(word_array) @words = remove_redundant_single_char_words(@words) if @compact @chinese = is_unicode?(@words[0]) @not_found = [] @stored_sentences = [] end |
Instance Attribute Details
- (Boolean) compact (readonly)
The value of the :compact options key.
21 22 23 |
# File 'lib/chinese/vocab.rb', line 21 def compact @compact end |
- (Array<String>) not_found (readonly)
of the supported online dictionaries during a call to either #sentences or #min_sentences.
Defaults to [].
25 26 27 |
# File 'lib/chinese/vocab.rb', line 25 def not_found @not_found end |
- (Array<Hash>) stored_sentences (readonly)
Holds the return value of either #sentences or #min_sentences,
whichever was called last. Defaults to [].
30 31 32 |
# File 'lib/chinese/vocab.rb', line 30 def stored_sentences @stored_sentences end |
- (Boolean) with_pinyin (readonly)
The value of the :with_pinyin option key.
27 28 29 |
# File 'lib/chinese/vocab.rb', line 27 def @with_pinyin end |
- (Object) words (readonly)
Returns the value of attribute words
19 20 21 |
# File 'lib/chinese/vocab.rb', line 19 def words @words end |
Class Method Details
+ (Array<String>) parse_words(path_to_csv, word_col, options) + (Array<String>) parse_words(path_to_csv, word_col)
Extracts the vocabulary column from a CSV file as an array of strings. The array is normally provided as an argument to #initialize
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/chinese/vocab.rb', line 74 def self.parse_words(path_to_csv, word_col, ={}) # Enforced options: # encoding: utf-8 (necessary for parsing Chinese characters) # skip_blanks: true .merge!({:encoding => 'utf-8', :skip_blanks => true}) csv = CSV.read(path_to_csv, ) raise ArgumentError, "Column number (#{word_col}) out of range." unless within_range?(word_col, csv[0]) # 'word_col counting starts at 1, but CSV.read returns an array, # where counting starts at 0. col = word_col-1 csv.reduce([]) {|words, row| word = row[col] # If word_col contains no data, CSV::read returns nil. # We also want to skip empty strings or strings that only contain whitespace. words << word unless word.nil? || word.strip.empty? words } end |
+ (Boolean) within_range?(column, row)
Input: column: word column number (counting from 1) row : Array of the processed CSV data that contains our word column.
586 587 588 589 |
# File 'lib/chinese/vocab.rb', line 586 def self.within_range?(column, row) no_of_cols = row.size column >= 1 && column <= no_of_cols end |
Instance Method Details
- (Object) add_key(hash_array, key, &block)
530 531 532 533 534 535 536 537 538 |
# File 'lib/chinese/vocab.rb', line 530 def add_key(hash_array, key, &block) hash_array.map do |row| if block row.merge({key => block.call(row)}) else row end end end |
- (Object) add_target_words(hash_array)
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 |
# File 'lib/chinese/vocab.rb', line 452 def add_target_words(hash_array) puts "Internal: Adding target words..." from_queue = Queue.new to_queue = Queue.new # semaphore = Mutex.new result = [] words = @words puts "add_target_words, words.size = #{words.size}" hash_array.each {|hash| from_queue << hash} 10.times.map { Thread.new(words) do while(row = from_queue.pop!) sentence = row[:chinese] target_words = target_words_per_sentence(sentence, words) to_queue << row.merge(:target_words => target_words) end end }.map {|thread| thread.join} to_queue.to_a end |
- (Object) alternate_source(sources, selection)
592 593 594 595 596 |
# File 'lib/chinese/vocab.rb', line 592 def alternate_source(sources, selection) sources = sources.dup sources.delete(selection) sources.pop end |
- (Boolean) contains_all_target_words?(selected_rows, sentence_key)
552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 |
# File 'lib/chinese/vocab.rb', line 552 def contains_all_target_words?(selected_rows, sentence_key) matched_words = @words.reduce([]) do |acc, word| result = selected_rows.find do |row| sentence = row[sentence_key] include_every_char?(word, sentence) end if result acc << word end acc end if matched_words.size == @words.size true else puts "-----------------------------" puts "#contains_all_target_words?" puts "Words not found:" p @words - matched_words puts "-----------------------------" false end #matched_words.size == @words.size end |
- (Object) convert(text)
447 448 449 |
# File 'lib/chinese/vocab.rb', line 447 def convert(text) eval(text.chomp) end |
- (Object) edit_vocab(word_array)
Remove all non-word characters
347 348 349 350 351 352 353 354 355 356 |
# File 'lib/chinese/vocab.rb', line 347 def edit_vocab(word_array) puts "Editing vocabulary..." word_array.map {|word| edited = remove_parens(word) edited = remove_slash(edited) edited = remove_er_character_from_end(edited) distinct_words(edited).join(' ') }.uniq end |
- (Boolean) is_boolean?(value)
340 341 342 343 |
# File 'lib/chinese/vocab.rb', line 340 def is_boolean?(value) # Only true for either 'false' or 'true' !!value == value end |
- (Object) make_hash(*data)
377 378 379 380 381 |
# File 'lib/chinese/vocab.rb', line 377 def make_hash(*data) require 'digest' data = data.reduce("") { |acc, item| acc << item.to_s } Digest::SHA2.hexdigest(data)[0..6] end |
- (Array<Hash>, []) min_sentences(options)
In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary, then calculates and the minimum number of sentences necessary to cover every word in #words at least once. The calculation is based on the fact that many words occur in more than one sentence.
The return value is also stored in #stored_sentences.
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
# File 'lib/chinese/vocab.rb', line 254 def min_sentences( = {}) @with_pinyin = validate { :with_pinyin } # Always run this method. thread_count = validate { :thread_count } sentences = sentences() puts "Calculating the minimum necessary sentences..." minimum_sentences = select_minimum_necessary_sentences(sentences) # :uwc = 'unique words count' with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) } # :uws = 'unique words string' = add_key(with_uwc_tag, :uws) do |row| words = row[:target_words].sort.join(', ') "[" + words + "]" end # Remove those keys we don't need anymore result = remove_keys(, :target_words, :word) @stored_sentences = result @stored_sentences end |
- (Object) remove_er_character_from_end(word)
359 360 361 362 363 364 365 |
# File 'lib/chinese/vocab.rb', line 359 def remove_er_character_from_end(word) if word.size > 2 word.gsub(/儿$/, '') else # Don't remove "儿" form words like 女儿 word end end |
- (Object) remove_keys(hash_array, *keys)
525 526 527 |
# File 'lib/chinese/vocab.rb', line 525 def remove_keys(hash_array, *keys) hash_array.map { |row| row.delete_keys(*keys) } end |
- (Object) remove_parens(word)
Helper functions
333 334 335 336 337 |
# File 'lib/chinese/vocab.rb', line 333 def remove_parens(word) # 1) Remove all ASCII parens and all data in between. # 2) Remove all Chinese parens and all data in between. word.gsub(/\(.*?\)/, '').gsub(/(.*?)/, '') end |
- (Object) remove_redundant_single_char_words(words)
Input: ["看", "书", "看书"] Output: ["看书"]
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 |
# File 'lib/chinese/vocab.rb', line 386 def remove_redundant_single_char_words(words) puts "Removing redundant single character words from the vocabulary..." single_char_words, multi_char_words = words.partition {|word| word.length == 1 } return single_char_words if multi_char_words.empty? non_redundant_single_char_words = single_char_words.reduce([]) do |acc, single_c| already_found = multi_char_words.find do |multi_c| multi_c.include?(single_c) end # Add single char word to array if it is not part of any of the multi char words. acc << single_c unless already_found acc end non_redundant_single_char_words + multi_char_words end |
- (Object) remove_slash(word)
368 369 370 371 372 373 374 |
# File 'lib/chinese/vocab.rb', line 368 def remove_slash(word) if word.match(/\//) word.split(/\//).sort_by { |w| w.size }.last else word end end |
- (Object) select_minimum_necessary_sentences(sentences)
499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 |
# File 'lib/chinese/vocab.rb', line 499 def select_minimum_necessary_sentences(sentences) with_target_words = add_target_words(sentences) rows = sort_by_target_word_count(with_target_words) selected_rows = [] unmatched_words = @words.dup matched_words = [] rows.each do |row| words = row[:target_words].dup # Delete all words from 'words' that have already been encoutered # (and are included in 'matched_words'). words = words - matched_words if words.size > 0 # Words that where not deleted above have to be part of 'unmatched_words'. selected_rows << row # Select this row. # When a row is selected, its 'words' are no longer unmatched but matched. unmatched_words = unmatched_words - words matched_words = matched_words + words end end selected_rows end |
- (Object) select_sentence(word, options)
Uses options passed from #sentences
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 |
# File 'lib/chinese/vocab.rb', line 407 def select_sentence(word, ) sentence_pair = Scraper.sentence(word, ) sources = Scraper::Sources.keys sentence_pair = try_alternate_download_sources(sources, word, ) if sentence_pair.empty? if sentence_pair.empty? @not_found << word return nil else chinese, english = sentence_pair result = Hash.new result.merge!(word: word) result.merge!(chinese: chinese) result.merge!(pinyin: chinese.) if @with_pinyin result.merge!(english: english) end end |
- (Hash) sentences(options)
(Normally you only call this method directly if you really need one sentence per Chinese word (even if these words might appear in more than one of the sentences.).
In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary, The return value is also stored in #stored_sentences.
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
# File 'lib/chinese/vocab.rb', line 132 def sentences(={}) puts "Fetching sentences..." # Always run this method. # We assign all options to a variable here (also those that are passed on) # as we need them in order to calculate the id. @with_pinyin = validate { :with_pinyin } thread_count = validate { :thread_count } id = make_hash(@words, .to_a.sort) words = @words from_queue = Queue.new to_queue = Queue.new file_name = id if File.exist?(file_name) puts "examining file" words, sentences, not_found = File.open(file_name) { |f| f.readlines } words = convert(words) convert(sentences).each { |s| to_queue << s } @not_found = convert(not_found) puts "Size(@not_found) = #{@not_found.size}" size_a = words.size size_b = to_queue.size puts "Size(words) = #{size_a}" puts "Size(to_queue) = #{size_b}" puts "Size(words+queue) = #{size_a+size_b}" # Remove file File.unlink(file_name) end words.each {|word| from_queue << word } result = [] Thread.abort_on_exception = false 1.upto(thread_count).map { Thread.new do while(word = from_queue.pop!) do begin local_result = select_sentence(word, ) puts "word: #{word}" # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT, # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e rescue Exception => e puts " #{e.}." puts "Please DO NOT abort the program but wait for all threads to terminate." puts "Number of running threads: #{Thread.list.size - 1}." puts "On termination of all threads, the data will be saved to disk for fast retrieval on the next run of the program." raise ensure from_queue << word if $! puts "Wrote '#{word}' to 'from_queue'" if $! end to_queue << local_result unless local_result.nil? end end }.each {|thread| thread.join } @stored_sentences = to_queue.to_a @stored_sentences ensure if $! while(Thread.list.size > 1) do # Wait for all child threads to terminate. sleep 5 end File.open(file_name, 'w') do |f| p "=============================" p "Writing data to file..." f.write from_queue.to_a f.puts f.write to_queue.to_a f.puts f.write @not_found puts "Finished writing data." puts "Please run the program again after solving the (connection) problem." end end end |
- (Array<String>) sentences_unique_chars(sentences)
If no argument is passed, the data from #stored_sentences is used as input
Finds the unique Chinese characters from either the data in #stored_sentences or an array of Chinese sentences passed as an argument.
302 303 304 305 306 307 308 309 310 |
# File 'lib/chinese/vocab.rb', line 302 def sentences_unique_chars(sentences = stored_sentences) # If the argument is an array of hashes, then it must be the data from @stored_sentences sentences = sentences.map { |hash| hash[:chinese] } if sentences[0].kind_of?(Hash) sentences.reduce([]) do |acc, row| acc = acc | row.scan(/\p{Word}/) # only return characters, skip punctuation marks acc end end |
- (Object) sort_by_target_word_count(with_target_words)
485 486 487 488 489 490 491 492 493 494 495 496 |
# File 'lib/chinese/vocab.rb', line 485 def sort_by_target_word_count(with_target_words) # First sort by size of unique word array (from large to short) # If the unique word count is equal, sort by the length of the sentence (from small to large) with_target_words.sort_by {|row| [-row[:target_words].size, row[:chinese].size] } # The above is the same as: # with_target_words.sort {|a,b| # first = -(a[:target_words].size <=> b[:target_words].size) # first.nonzero? || (a[:chinese].size <=> b[:chinese].size) } end |
- (Object) target_words_per_sentence(sentence, words)
480 481 482 |
# File 'lib/chinese/vocab.rb', line 480 def target_words_per_sentence(sentence, words) words.select {|w| include_every_char?(w, sentence) } end |
- to_csv(path_to_file, options) - to_csv(path_to_file)
This method returns an undefined value.
Saves the data stored in #stored_sentences to disk.
321 322 323 324 325 326 327 328 |
# File 'lib/chinese/vocab.rb', line 321 def to_csv(path_to_file, = {}) CSV.open(path_to_file, "w", ) do |csv| @stored_sentences.each do |row| csv << row.values end end end |
- (Object) try_alternate_download_sources(alternate_sources, word, options)
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 |
# File 'lib/chinese/vocab.rb', line 428 def try_alternate_download_sources(alternate_sources, word, ) sources = alternate_sources.dup sources.delete([:source]) result = sources.find do |s| = .merge(:source => s) sentence = Scraper.sentence(word, ) sentence.empty? ? nil : sentence end if result optins = .merge(:source => result) Scraper.sentence(word, ) else [] end end |
- (Object) uwc_tag(string)
541 542 543 544 545 546 547 548 549 |
# File 'lib/chinese/vocab.rb', line 541 def uwc_tag(string) size = string.length case size when 1 "1_word" else "#{size}_words" end end |