Pythonic Way To Implement A Tokenizer
Solution 1:
There's an undocumented class in the re
module called re.Scanner
. It's very straightforward to use for a tokenizer:
import re
scanner=re.Scanner([
(r"[0-9]+", lambda scanner,token:("INTEGER", token)),
(r"[a-z_]+", lambda scanner,token:("IDENTIFIER", token)),
(r"[,.]+", lambda scanner,token:("PUNCTUATION", token)),
(r"\s+", None), # None == skip token.
])
results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results
will result in
[('INTEGER', '45'),
('IDENTIFIER', 'pigeons'),
('PUNCTUATION', ','),
('INTEGER', '23'),
('IDENTIFIER', 'cows'),
('PUNCTUATION', ','),
('INTEGER', '11'),
('IDENTIFIER', 'spiders'),
('PUNCTUATION', '.')]
I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.
Solution 2:
Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.
Solution 3:
In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.
For generator see the original proposal or other online docs
Solution 4:
Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):
classTokenizer(object):
def__init__(self,file):
self.file = file
def__get_next_character(self):
return self.file.read(1)
def__peek_next_character(self):
character = self.file.read(1)
self.file.seek(self.file.tell()-1,0)
return character
def__read_number(self):
value = ""while self.__peek_next_character().isdigit():
value += self.__get_next_character()
return value
defnext_token(self):
character = self.__peek_next_character()
if character.isdigit():
return self.__read_number()
Solution 5:
"Is there a better alternative to just simply returning a list of tuples?"
Nope. It works really well.
Post a Comment for "Pythonic Way To Implement A Tokenizer"