Class: Rly::Lex

Inherits:
Object
  • Object
show all
Defined in:
lib/rly/lex.rb,
lib/rly/helpers.rb

Overview

Base class for your lexer.

Generally, you define a new lexer by subclassing Rly::Lex. Your code should use methods Lex.token, Lex.ignore, Lex.literals, Lex.on_error to make the lexer configuration (check the methods documentation for details).

Once you got your lexer configured, you can create its instances passing a String to be tokenized. You can then use #next method to get tokens. If you have more string to tokenize, you can append it to input buffer at any time with #input.

Direct Known Subclasses

FileLex

Instance Attribute Summary collapse

DSL Class Methods collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input = "") ⇒ Lex

Creates a new lexer instance for given input

Examples:

class MyLexer < Rly::Lex
  ignore " "
  token :LOWERS, /[a-z]+/
  token :UPPERS, /[A-Z]+/
end

lex = MyLexer.new("hello WORLD")
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

63
64
65
66
67
# File 'lib/rly/lex.rb', line 63

def initialize(input="")
  @input = input
  @pos = 0
  @lineno = 0
end

Instance Attribute Details

#linenoFixnum

Tracks the current line number for generated tokens

lineno's value should be increased manually. Check the example for a demo rule.

Examples:

token /\n+/ do |t| t.lexer.lineno = t.value.count("\n"); t end

30
31
32
# File 'lib/rly/lex.rb', line 30

def lineno
  @lineno
end

#posFixnum

Tracks the current position in the input string

Genreally, it should only be used to skip a few characters in the error hander.

Examples:

on_error do |t|
  t.lexer.pos += 1
  nil # skip the bad character
end

44
45
46
# File 'lib/rly/lex.rb', line 44

def pos
  @pos
end

Class Method Details

.callablesObject


191
192
193
# File 'lib/rly/lex.rb', line 191

def callables
  @callables ||= {}
end

.ignore(ign) ⇒ Object

Specifies a list of one-char symbols to be ignored in input

This method allows to skip over formatting symbols (like tabs and spaces) quickly.

Examples:

class MyLexer < Rly::Lex
  literals "+-"
  token :INT, /\d+/
  ignore " \t"
end

lex = MyLexer.new("2 + 2")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "INT -> 2"
                                     #=> "+ -> +"
                                     #=> "INT -> 2"
end

See Also:


346
347
348
349
# File 'lib/rly/lex.rb', line 346

def ignore(ign)
  @ignores = ign
  nil
end

.ignore_spaces_and_tabsObject


4
5
6
# File 'lib/rly/helpers.rb', line 4

def self.ignore_spaces_and_tabs
  ignore " \t"
end

.lex_double_quoted_string_tokensObject


15
16
17
18
19
20
# File 'lib/rly/helpers.rb', line 15

def self.lex_double_quoted_string_tokens
  token :STRING, /"[^"]*"/ do |t|
    t.value = t.value[1...-1]
    t
  end
end

.lex_number_tokensObject


8
9
10
11
12
13
# File 'lib/rly/helpers.rb', line 8

def self.lex_number_tokens
  token :NUMBER, /\d+/ do |t|
    t.value = t.value.to_i
    t
  end
end

.literals(lit) ⇒ Object

Specifies a list of one-char literals

Literals may be used in the case when you have several one-character tokens and you don't want to define them one by one using token method.

Examples:

class MyLexer < Rly::Lex
  literals "+-/*"
end

lex = MyLexer.new("+-")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "+ -> +"
                                     #=> "- -> -"
end

See Also:


320
321
322
323
# File 'lib/rly/lex.rb', line 320

def literals(lit)
  @literals = lit
  nil
end

.metatokens(*args) ⇒ Object


218
219
220
# File 'lib/rly/lex.rb', line 218

def metatokens(*args)
  @metatokens_list = args
end

.metatokens_listObject


214
215
216
# File 'lib/rly/lex.rb', line 214

def metatokens_list
  @metatokens_list ||= []
end

.on_error(&block) ⇒ Object

Specifies a block that should be called on error

In case of lexing error the lexer first tries to fix it by providing a chance for developer to look on the failing character. If this block is not provided, the lexing error always results in Rly::LexError.

You must increment the lexer's #pos as part of the action. You may also return a new Rly::LexToken or nil to skip the input

Examples:

class MyLexer < Rly::Lex
  token :INT, /\d+/
  on_error do |tok|
    tok.lexer.pos += 1 # just skip the offending character
  end
end

lex = MyLexer.new("123qwe")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "INT -> 123"
end

See Also:


375
376
377
378
# File 'lib/rly/lex.rb', line 375

def on_error(&block)
  @error_block = block
  nil
end

.terminalsObject


187
188
189
# File 'lib/rly/lex.rb', line 187

def terminals
  self.tokens.map { |t,r,b| t }.compact + self.literals_list.chars.to_a + self.metatokens_list
end

.token(*args) {|tok| ... } ⇒ Object

Adds a token definition to a class

This method adds a token definition to be lated used to tokenize input. It can be used to register normal tokens, and also functional tokens (the latter ones are processed as usual but are not being returned).

Examples:

class MyLexer < Rly::Lex
  token :LOWERS, /[a-z]+/   # this would match LOWERS on 1+ lowercase letters

  token :INT, /\d+/ do |t|  # this would match on integers
    t.value = t.value.to_i  # additionally the value is converted to Fixnum
    t                       # the updated token is returned
  end

  token /\n/ do |t|        # this would match on newlines
    t.lexer.lineno += 1    # the block will be executed on match, but
  end                      # no token will be returned (as name is not specified)

end

Yield Parameters:

  • tok (LexToken)

    a new token instance for processed input

Yield Returns:

  • (LexToken)

    the same or modified token instance. Return nil to ignore the input

See Also:


290
291
292
293
294
295
296
297
298
299
# File 'lib/rly/lex.rb', line 290

def token(*args, &block)
  if args.length == 2
    self.tokens << [args[0], args[1], block]
  elsif args.length == 1
    self.tokens << [nil, args[0], block]
  else
    raise ArgumentError
  end
  nil
end

.token_regexpsObject


195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# File 'lib/rly/lex.rb', line 195

def token_regexps
  return @token_regexps if @token_regexps

  collector = []
  self.tokens.each do |name, rx, block|
    name = "__anonymous_#{block.hash}".to_sym unless name

    self.callables[name] = block
    
    rxs = rx.to_s
    named_rxs = "\\A(?<#{name}>#{rxs})"

    collector << named_rxs
  end

  rxss = collector.join('|')
  @token_regexps = Regexp.new(rxss)
end

Instance Method Details

#build_token(type, value) ⇒ Object


178
179
180
# File 'lib/rly/lex.rb', line 178

def build_token(type, value)
  LexToken.new(type, value, self, @pos, @lineno)
end

#ignore_symbolObject


182
183
184
# File 'lib/rly/lex.rb', line 182

def ignore_symbol
  @pos += 1
end

#input(input) ⇒ Object

Appends string to input buffer

The given string is appended to input buffer, further #next calls will tokenize it as usual.

Examples:

lex = MyLexer.new("hello")

t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next # => nil
lex.input("WORLD")
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

90
91
92
93
# File 'lib/rly/lex.rb', line 90

def input(input)
  @input << input
  nil
end

#inspectObject


69
70
71
# File 'lib/rly/lex.rb', line 69

def inspect
  "#<#{self.class} pos=#{@pos} len=#{@input.length} lineno=#{@lineno}>"
end

#nextLexToken?

Processes the next token in input

This is the main interface to lexer. It returns next available token or *nil* if there are no more tokens available in the input string.

#each Raises Rly::LexError if the input cannot be processed. This happens if there were no matches by 'token' rules and no matches by 'literals' rule. If the on_error handler is not set, the exception will be raised immediately, however, if the handler is set, the eception will be raised only if the #pos after returning from error handler is still unchanged.

Examples:

lex = MyLexer.new("hello WORLD")

t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

Raises:

  • (LexError)

    if the input cannot be processed


119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# File 'lib/rly/lex.rb', line 119

def next
  while @pos < @input.length
    if self.class.ignores_list[@input[@pos]]
      ignore_symbol
      next
    end

    m = self.class.token_regexps.match(@input[@pos..-1])

    if m && ! m[0].empty?
      val = nil
      type = nil
      resolved_type = nil
      m.names.each do |n|
        if m[n]
          type = n.to_sym
          resolved_type = (n.start_with?('__anonymous_') ? nil : type)
          val = m[n]
          break
        end
      end

      if type
        tok = build_token(resolved_type, val)
        @pos += m.end(0)
        tok = self.class.callables[type].call(tok) if self.class.callables[type]

        if tok && tok.type
          return tok
        else
          next
        end
      end
    end
    
    if self.class.literals_list[@input[@pos]]
      tok = build_token(@input[@pos], @input[@pos])
      matched = true
      @pos += 1
      return tok
    end

    if self.class.error_hander
      pos = @pos
      tok = build_token(:error, @input[@pos])
      tok = self.class.error_hander.call(tok)
      if pos == @pos
        raise LexError.new("Illegal character '#{@input[@pos]}' at index #{@pos}")
      else
        return tok if tok && tok.type
      end
    else
      raise LexError.new("Illegal character '#{@input[@pos]}' at index #{@pos}")
    end

  end
  return nil
end