Module: Ebooks::NLP

Defined in:
lib/moo_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION =

We deliberately limit our punctuation handling to stuff we can do consistently It'll just be a part of another token if we don't split it out, and that's fine

'.?!,'

Class Method Summary collapse

Class Method Details

.htmlentitiesHTMLEntities

Lazily load HTML entity decoder

Returns:

  • (HTMLEntities)

29
30
31
# File 'lib/moo_ebooks/nlp.rb', line 29

def self.htmlentities
  @htmlentities ||= HTMLEntities.new
end

.keywords(text) ⇒ Highscore::Keywords

Use highscore gem to find interesting keywords in a corpus

Parameters:

  • text (String)

Returns:

  • (Highscore::Keywords)

67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/moo_ebooks/nlp.rb', line 67

def self.keywords(text)
  # Preprocess to remove stopwords (highscore's blacklist is v. slow)
  text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')

  text = Highscore::Content.new(text)

  text.configure do
    # set :multiplier, 2
    # set :upper_case, 3
    # set :long_words, 2
    # set :long_words_threshold, 15
    # set :vowels, 1                     # => default: 0 = not considered
    # set :consonants, 5                 # => default: 0 = not considered
    # set :ignore_case, true             # => default: false
    set :word_pattern, /(?<[email protected])(?<=\s)[\p{Word}']+/ # => default: /\w+/
    # set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ String

Normalize some strange unicode punctuation variants

Parameters:

  • text (String)

Returns:

  • (String)

38
39
40
41
# File 'lib/moo_ebooks/nlp.rb', line 38

def self.normalize(text)
  htmlentities.decode(text.tr('', '"').tr('', '"').tr('', "'")
    .gsub('', '...'))
end

.punctuation?(token) ⇒ Boolean

Is this token comprised of punctuation?

Parameters:

  • token (String)

Returns:

  • (Boolean)

122
123
124
# File 'lib/moo_ebooks/nlp.rb', line 122

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tikis, tokens) ⇒ String

Builds a proper sentence from a list of tikis

Parameters:

  • tikis (Array<Integer>)
  • tokens (Array<String>)

Returns:

  • (String)

92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/moo_ebooks/nlp.rb', line 92

def self.reconstruct(tikis, tokens)
  text = ''
  last_token = nil
  tikis.each do |tiki|
    next if tiki == INTERIM
    token = tokens[tiki]
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ Array<String>

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation

Parameters:

  • text (String)

Returns:

  • (Array<String>)

49
50
51
# File 'lib/moo_ebooks/nlp.rb', line 49

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens

Parameters:

  • token1 (String)
  • token2 (String)

Returns:

  • (Boolean)

109
110
111
112
113
114
115
116
117
# File 'lib/moo_ebooks/nlp.rb', line 109

def self.space_between?(token1, token2)
  p1 = punctuation?(token1)
  p2 = punctuation?(token2)
  if (p1 && p2) || (!p1 && p2) # "foo?!" || "foo."
    false
  else # "foo rah" || "foo. rah"
    true
  end
end

.stopword?(token) ⇒ Boolean

Is this token a stopword?

Parameters:

  • token (String)

Returns:

  • (Boolean)

129
130
131
132
# File 'lib/moo_ebooks/nlp.rb', line 129

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwordsArray<String>

Lazily loads an array of stopwords Stopwords are common words that should often be ignored

Returns:

  • (Array<String>)

23
24
25
# File 'lib/moo_ebooks/nlp.rb', line 23

def self.stopwords
  @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end

.subseq?(ary1, ary2) ⇒ Boolean

Determine if ary2 is a subsequence of ary1

Parameters:

  • ary1 (Array)
  • ary2 (Array)

Returns:

  • (Boolean)

164
165
166
167
168
# File 'lib/moo_ebooks/nlp.rb', line 164

def self.subseq?(ary1, ary2)
  !ary1.each_index.find do |i|
    ary1[i...i + ary2.length] == ary2
  end.nil?
end

.tokenize(sentence) ⇒ Array<String>

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps

Parameters:

  • sentence (String)

Returns:

  • (Array<String>)

58
59
60
61
62
# File 'lib/moo_ebooks/nlp.rb', line 58

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|
    (?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/x
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry

Parameters:

  • text (String)

Returns:

  • (Boolean)

139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/moo_ebooks/nlp.rb', line 139

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened.negative? # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end