Question: Python: Your Task: Define get _ top _ tokens as follows: Given a cluster, identify its most frequent tokens. Inputs: cid: The ID of a
Python:
Your Task: Define gettoptokens as follows:
Given a cluster, identify its most frequent tokens.
Inputs:
cid: The ID of a cluster to analyse
labels: cluster assignments where document i was assigned to cluster labels i
corpusdf: A corpus dataframe having the columns id 'title' and 'pesudodoc'.
k: the number of tokens to return
Return: toptokens, a pythonset of the k most frequent tokens
Steps:
Use labels to identify the documents
select thier corresponding pesudodocuments "corpusdf"
For each unique token, count the number of pesudodocuments in which it appeared
Return the k most frequently occuring tokens as a python set. In the case of ties, consider tokens in ascending order
Other notes:
to match a document ID i to its pseudodocument, note that i is the index value of corpusdf and in particular, not the id column!
if there are fewer than k unique tokens, return as many as available
if there are ties, use the token itself sort the tokens by name
Example: for the demo code, a correct implementation should return:
page 'reviews', 'academic', 'book'
Solution##
def gettoptokenscid: int, labels:npndarray, corpusdf: pdDataFrame, k set:
## code
with openresourceasnlibpublicdatademoargsgettoptokens.dill', rb as fp:
democid, demolabels, democorpusdf, demokdill.loadfp
printgettoptokensdemocid, demolabels, democorpusdf, demok
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
