Zipf s law of word distribution states the following: Take a large corpus of text, count the
Question:
Zipf ’s law of word distribution states the following: Take a large corpus of text, count the frequency of every word in the corpus, and then rank these frequencies in decreasing order. Let fI be the Ith largest frequency in this list; that is, f1 is the frequency of the most common word (usually “the”), f2 is the frequency of the second most common word, and so on. Zipf’s law states that fI is approximately equal to α/I for some constant α. The law tends to be highly accurate except for very small and very large values of I.
Choose a corpus of at least 20,000 words of online text, and verify Zipf’s law experimentally. Define an error measure and find the value of α where Zipf’s law best matches your experimental data. Create a log–log graph plotting fI vs. I and α/I vs. I. (On a log–log graph, the function α/I is a straight line.) In carrying out the experiment, be sure to eliminate any formatting tokens (e.g., HTML tags) and normalize upper and lower case.
Step by Step Answer:
Artificial Intelligence A Modern Approach
ISBN: 9780134610993
4th Edition
Authors: Stuart Russell, Peter Norvig