Pythonic Way To Implement A Tokenizer

August 09, 2024 Post a Comment

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the the

Solution 1:

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

Solution 2:

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

Solution 3:

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.

For generator see the original proposal or other online docs

Solution 4:

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):

classTokenizer(object):

  def__init__(self,file):
     self.file = file

  def__get_next_character(self):
      return self.file.read(1)

  def__peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def__read_number(self):
      value = ""while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  defnext_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()

Solution 5:

"Is there a better alternative to just simply returning a list of tuples?"

Nope. It works really well.

lacucinadiadine