Class: Myaso::Tagger::TnT

Inherits:
Model
  • Object
show all
Defined in:
lib/myaso/tagger/tnt.rb

Overview

A Tagger model that can work with TnT data files.

Constant Summary collapse

START =

A start tag for a sentence.

'SENT'
STOP =

A stop tag for a sentence.

'SENT'
MISSING =

Unknown tag for token.

'-'
CARD =

Tokens consisting of a sequence of decimal digits.

'@CARD'
CARDPUNCT =

Decimal digits followed by punctuation.

'@CARDPUNCT'
CARDSUFFIX =

Decimal digits followed by any suffix.

'@CARDSUFFIX'
CARDSEPS =

Decimal digits separated by dots, dashes, etc.

'@CARDSEPS'
UNKNOWN =

Tag frequencies to handle unknown words.

'@UNKNOWN'

Instance Attribute Summary collapse

Attributes inherited from Model

#interpolations, #lexicon, #ngrams

Instance Method Summary collapse

Methods inherited from Model

#conditional, #e, #q, #rare?

Constructor Details

#initialize(ngrams_path, lexicon_path, interpolations = nil) ⇒ TnT

The tagging model is initialized by two data files. The first one is a n-grams file that stores statistics for unigrams, bigrams, trigrams. The second one is a lexicon file that stores words and their frequencies in the source corpus.

Please note that the learning stage is not so optimized, so the initialization procedure may take about 120 seconds.


48
49
50
51
52
# File 'lib/myaso/tagger/tnt.rb', line 48

def initialize(ngrams_path, lexicon_path, interpolations = nil)
  @ngrams_path = File.expand_path(ngrams_path)
  @lexicon_path = File.expand_path(lexicon_path)
  super(interpolations)
end

Instance Attribute Details

#lexicon_pathObject (readonly)

Returns the value of attribute lexicon_path


38
39
40
# File 'lib/myaso/tagger/tnt.rb', line 38

def lexicon_path
  @lexicon_path
end

#ngrams_pathObject (readonly)

Returns the value of attribute ngrams_path


38
39
40
# File 'lib/myaso/tagger/tnt.rb', line 38

def ngrams_path
  @ngrams_path
end

Instance Method Details

#classify(word) ⇒ Object

If word is rare, it can be one of the following categories: includes numbers, numbers and punctuation symbols, non-numbers following numbers and unknown. Otherwise, word has it's own category.


58
59
60
61
62
63
64
65
66
67
# File 'lib/myaso/tagger/tnt.rb', line 58

def classify(word)
  return word unless rare? word
  case word
  when /^\d+$/ then CARD
  when /^\d+[.,;:]+$/ then CARDPUNCT
  when /^\d+\D+$/ then CARDSUFFIX
  when /^\d+[.,;:\-]+(\d+[.,;:\-]+)*\d+$/ then CARDSEPS
  else UNKNOWN
  end
end

#compute_interpolations!Object

Count coefficients for linear interpolation for evaluating q(first, second, third).


132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# File 'lib/myaso/tagger/tnt.rb', line 132

def compute_interpolations!
  lambdas = [0.0, 0.0, 0.0]

  unigram, bigram = nil, nil

  read(ngrams_path) do |first, second, third, count|
    first = unigram unless first
    second = bigram unless second

    unless third && count
      unigram, bigram = first, second
      next
    end

    count = count.to_i

    f = Array.new

    f << conditional(ngrams[third] - 1, ngrams.unigrams_count - 1)
    f << conditional(ngrams[second, third] - 1, ngrams[second] - 1)
    f << conditional(count - 1, ngrams[first, second] - 1)

    index = f.index(f.max)

    lambdas[index] += count if index
  end

  total = lambdas.inject(&:+)
  @interpolations = lambdas.map! { |l| l / total }
end

#inspectObject


165
166
167
168
169
# File 'lib/myaso/tagger/tnt.rb', line 165

def inspect
  sprintf('#<%s @ngrams_path=%s @lexicon_path=%s @interpolations=%s>',
    self.class.name, ngrams_path.inspect, lexicon_path.inspect,
    interpolations.inspect)
end

#learn!Object

Parse n-grams and lexicon files, and compute statistics over them.


83
84
85
86
87
# File 'lib/myaso/tagger/tnt.rb', line 83

def learn!
  parse_ngrams!
  parse_lexicon!
  compute_interpolations! if interpolations.nil?
end

#parse_lexicon!Object

Parse the lexicon file.


114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/myaso/tagger/tnt.rb', line 114

def parse_lexicon!
  read(lexicon_path) do |values|
    values.compact!

    word, word_count, rare = values.shift, values.shift.to_i, false
    word = classify(word) if rare = (word_count == 1)

    lexicon[word] += word_count

    values.each_slice(2) do |tag, count|
      lexicon[word, tag] += count.to_i
    end
  end
end

#parse_ngrams!Object

Parse the n-grams file.


91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/myaso/tagger/tnt.rb', line 91

def parse_ngrams!
  unigram, bigram = nil, nil

  read(ngrams_path) do |values|
    values[0] = unigram unless values[0]
    values[1] = bigram unless values[1]

    if values[0] && values[1] && values[2] && values[3] # a trigram
      ngrams[*values[0..2]] = values[3].to_i
    elsif values[0] && values[1] && values[2] && !values[3] # a bigram
      ngrams[*values[0..1]] = values[2].to_i
    elsif values[0] && values[1] && !values[2] && !values[3] # an unigram
      ngrams[values[0]] = values[1].to_i
    else
      raise 'dafuq i just read: %s' % values.inspect
    end

    unigram, bigram = values[0], values[1]
  end
end

#start_symbolObject

Tagger requires the sentence start symbol to be defined.


71
72
73
# File 'lib/myaso/tagger/tnt.rb', line 71

def start_symbol
  START
end

#stop_symbolObject

Tagger requires the sentence stop symbol to be defined.


77
78
79
# File 'lib/myaso/tagger/tnt.rb', line 77

def stop_symbol
  STOP
end