Question: Python: Your Task: Define get _ top _ tokens as follows: Given a cluster, identify its most frequent tokens. Inputs: cid: The ID of a

Python:
Your Task: Define get_top_tokens as follows:
Given a cluster, identify its most frequent tokens.
Inputs:
cid: The ID of a cluster to analyse
labels: cluster assignments where document i was assigned to cluster labels [i].
corpusdf: A corpus dataframe having the columns 'id', 'title' and 'pesudodoc'.
k: the number of tokens to return
Return: top_tokens, a pythonset of the k most frequent tokens
Steps:
*) Use labels to identify the documents
*) select thier corresponding pesudo-documents "corpusdf"
*) For each unique token, count the number of pesudo-documents in which it appeared
*)Return the k most frequently occuring tokens as a python set. In the case of ties, consider tokens in ascending order
Other notes:
*) to match a document ID i to its pseudo-document, note that i is the index value of corpusdf (and, in particular, not the 'id' column!)
*)if there are fewer than k unique tokens, return as many as available
*) if there are ties, use the token itself sort the tokens by name
Example: for the demo code, a correct implementation should return:
{'page', 'reviews', 'academic', 'book'}
Solution##
def get_top_tokens(cid: int, labels:np.ndarray, corpusdf: pd.DataFrame, k=10)-> set:
## code
with open('resource/asnlib/publicdata/demo_args_get_top_tokens.dill', 'rb') as fp:
demo_cid, demo_labels, demo_corpusdf, demo_k=dill.load(fp)
print(get_top_tokens(demo_cid, demo_labels, demo_corpusdf, demo_k))

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!