Question: def build _ corpus ( texts , lang = en ) : Corpus builder which returns a Corpus object processed

def build_corpus(texts, lang="en"):
"""Corpus builder which returns a Corpus object processed on texts as language
specified by lang (defaults to "en"):
Should perform all of the following pre-processing functions:
- Lemmatize the tokens
- Convert tokens to lowercase
- Remove punctuation
- Remove numbers
- Remove tokens shorter than 2 characters
"""
# Here, we just use the index of the text as the label for the corpus item
corpus = Corpus({ i:r for i, r in enumerate(texts)}, language=lang)
# TODO: Complete the implementation of this function and submit the
# .py download of this notebook as your assignment submission.
example_docs =[ # Feel free to edit this corpus for further testing
# to be sure that your functions meet specifications.
"The 3 cats sat on the mats!",
"1 fish 2 fish Red fish Blue fish",
"She sells $ea$shells"
]
example_corpus = build_corpus(example_docs)
corpus_tokens_flattened(example_corpus)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!