Question: Algorithm Tokenize ( string s ): list ( string ) tokens := [ ] while not isEmpty(s) If s begins with a token, remove the

Algorithm Tokenize(string s):

list(string) tokens := [ ]

while not isEmpty(s)

If s begins with a token, remove the longest possible token from the beginning of s and push that token onto the back of tokens

If head(s) is a whitespace character, pop(s).

return tokens

For example, below is a trace of the call tokenize("hello, world 3+1")

# it

tokens

s

0

[]

"hello, world 3+1"

1

["hello"]

", world 3+1"

2

["hello",","]

" world 3+1"

3

["hello",","]

"world 3+1"

4

["hello",",","world"]

" 3+1"

5

["hello",",","world"]

"3+1"

6

["hello",",","world","3"]

"+1"

7

["hello",",","world","3","+"]

"1"

8

["hello",",","world","3","+","1"]

""

Homework:

Implement Tokenize in Python. That is, write a Python function called tokenize such that if s is a tring for which Tokenize(s) is defined, then Tokenize(s) == tokenize(s).

Test cases:

Note the following

tokenize("hello world")

["hello", "world"]

tokenize("hello'world'foo")

["hello", "'world'", "foo"]

["hello", "\'world\'", "foo"]

['hello', '\'world\'', 'foo']

tokenize("'hello\\'world'")

["'hello\\'world'"]

tokenize("'hello world'")

["'hello world'"]

tokenize("3.33(33..")

not in test suite

tokenize("\\")

["\\"]

tokenize("\t")

[ ]

tokenize("'a'e'c'")

["'a'", "e", "'c'"]

tokenize("3.3+1")

["3.3","+","1"]

tokenize("''")

["''"]

For example, you could enter into repl.it or IDLE,

>> tokenize("'hello\\'world'")== ["'hello\\'world'"]

>> True

Send your tokenizer as an attached .py file

EXAMPLE CODE

The Python program below implements Tokenize. In this exercise, we assume the input is correct (that is, that it consists of tokens, whitespace, and comments). This is an oversimplification of the problem of tokenizing real world text, but it is still not easy.

# whiteChar(c) iff c is a whitespace character.

def whiteChar(c): return (c in " \t\v")

# tokenize(s) = Tokenize(s), whenever Tokenize(s) is defined

def tokenize(s):

i = 0

tokens = [ ]

# invariants:0 <= i <= len(s), tokens = Tokenize(s[:i]),

while i < len(s):

if whiteChar(s[i]):

i = i+1

elif i < len(s)-1 and s[i:i+2] == "//":

# skip the comment until the next return or new line character

i = i+2

while i < len(s) and s[i] not in " ":

i = i+1

else:

# process the longest possible token

tok = munch(s,i)

tokens.append(tok)

i = i + len(tok)

return tokens

# If 0<= i < len(s) and s[i:] begins with a token, then munch(s,i)

# is longest token that is a prefix of s[i:]

def munch(s,i):

A,j = 'em',i

# invariants: i <= j <= len(s), A is the state that describes s[i:j]

while True:

if j == len(s): break # end of string

A = newState(A,s[j])

# A is now the state that *would* result if we process

# one more character.

if A == 'err': break

# A is not 'err', so good with one more character

j = j+1

return s[i:j]

'''

A *state* is a string . States *describe* strings as given below:

1. 'em' describes str iff str is empty

2. 'id' describes str iff str is an identifier

3. 'num' describes str iff is a numeral

4. 'err' describes str iff str is not a prefix of any token

'''

# If state A describes a string and c is a character, than newState(A,c)

# describes the string A+c.

def newState(A,c):

if A=='em':

if c.isalpha(): return 'id'

elif c.isdigit(): return 'num'

elif A=='id':

if (c.isalpha() or c.isdigit()): return 'id'

elif A == 'num':

if c.isdigit(): return 'num'

return 'err'

# END PROGRAM

Below is a transcript of a test session in repl.it with the program above:

Python 3.5.2 (default, Dec 2015, 13:05:11) [GCC 4.8.2] on linux tokenize("hello world") => ['hello', 'world']

tokenize("hello //comment world//") => ['hello', 'world']

tokenize("hello you little 8// com") => ['hello', 'you', 'little', '8']

tokenize("duh 8 // comment hi3") => ['duh', '8', 'hi3']

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!