Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a sample text (e.g., a book), and a word within the text. the frequency of the word is defined as the number of occurrences of the word within the text. The rank of a word is defined as the position of the word in a ranking of words by frequency (in descending order). Thus, the most common word, i.e., the one with highest frequency, has rank 1, the second most common word has rank 2, and so on. If we denote by r the rank of a word, and by f its frequency, then, accordingly to Zipf's law, we have that: f(r) = cr³. where c and s are parameters which depend on the particular text and language. It turns out that, for most texts and languages, s is (almost) 1, i.e., f is (almost) inversely proportional to r. This means that the frequency of word with rank 2 will be approximately half the one of the most common word, the one with rank 4 approximately a factor of 4 smaller, and so on. In words, Zipt's law states that only a few words are used very often, while many or most are used rarely. Although Zipf's law was first observed empirically in the field of quantitative linguistics, the same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, among others. Plotting rank-frequency curves In order to evaluate in a plot whether a measured rank-frequency relationship actually follows Zipf's law, it helps to take the base 10 logarithm on both sides of the equation above, leading to: logio (f(r)) = log10 (cr) = log₁0 (c) slog10 (r). Thus, if we rename y = logio (f(r)) and x = logio(r), and plot y versus x, then we should get a straight line with slopes and intercept logio (c). (Recall that the general equation of a straight-line y = g(x) is y = n + mx, where n is the intercept and m is the slope.) Texts subject to analysis. File format explained We will analyze four different classic books written in English, namely: . From the Earth to the Moon, by Jules Verne • Time Machine, by Herbert George Wells • The Picture of Dorian Gray, by Oscar Wilde • The Adventures of Sherlock Holmes, by Arthur Conan Doyle The source of these books will be Project Gutenberg, an electronic library with more than 60,000 Ebooks (at the date of writing). Project Gutenberg's library only includes "public domain" works that are out of copyright. You will find the pure text file (ASCII) corresponding to each of the four books in a folder available at the Moodle page of the unit. You have to download these files in the same folder of your system where you downloaded this notebook. A frequency table, main program data structure The main data structure of the code that you have to write is a frequency table. A frequency table is built out of a given text. It is a data structure which, given a word, provides its frequency in the text. The first big decision to be made in this assignment is: which is the most appropriate Python data structure to store the frequency table? An appropriate data structure should let you efficiently add new elements to the table in a dynamic way (this excludes NumPy arrays), and also should be able to provide the frequency associated to a word in time independent on the number of elements of the table (this excludes lists). Task 1 (2/22 points). Write a function that given an already opened file object, modifies the file object such that the file object is positioned in the first line of the body of the book. The function is though not to be a fruitful function, i.e., it does not return a value. Hint: you may consider the startswith method of type str in order to solve this task. (Recall that you can ask for help calling help (str. startswith) on a code cell.) Task 2 (4/22 points). Write a function that given a line in the body of the Ebook (as a string), and the frequency table, processes the line and updates the frequency table conformally with the contents of the line. The function is though not to be a fruitful function, i.e., it does not return a value. Some hints and considerations: • Before splitting the line in words, replace hyphens (i.e., "-") by blank spaces (i.e., *). You may consider useful to use the replace method of type str • After splitting the line in words, the resulting words may have leading and/or trailing punctuation signs or blank spaces. The string module provides the punctuation variable, which contains all English punctuation signs. Figure out how you can use the strip method of str and such a variable in order to get rid of the leading and/or trailing punctuation signs in each word. • We will neglect the case of letters in our analysis. Thus, for example, we will consider "The" and "the" to be the same word. Thus, before accessing the table, you must transform all letters of the word into lower case. Figure out how the lower method of str can be helpful for such purpose. Task 3 (4/22 points). Write a function that given a file's name of a plain text Gutenberg's Ebook, returns its associated frequency table. The function MUST use the functions written in Task 1 & 2. Task 4 (4/22 points). Write a function that given a frequency table, returns a list of (frequency, word) pairs (i.e., tuples of two elements) in descending order by frequency. Write in a text cell the answer to the following questions: Which are the top-3 words and associated frequencies in the four books subject of study? Describe in your own words how does the frequency decay with rank for the words in the top-3. Hint: you may consider the sort method of type list in order to solve this task. (Recall that you can ask for help calling help (list.sort) on a code cell.) Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a sample text (e.g., a book), and a word within the text. the frequency of the word is defined as the number of occurrences of the word within the text. The rank of a word is defined as the position of the word in a ranking of words by frequency (in descending order). Thus, the most common word, i.e., the one with highest frequency, has rank 1, the second most common word has rank 2, and so on. If we denote by r the rank of a word, and by f its frequency, then, accordingly to Zipf's law, we have that: f(r) = cr³. where c and s are parameters which depend on the particular text and language. It turns out that, for most texts and languages, s is (almost) 1, i.e., f is (almost) inversely proportional to r. This means that the frequency of word with rank 2 will be approximately half the one of the most common word, the one with rank 4 approximately a factor of 4 smaller, and so on. In words, Zipt's law states that only a few words are used very often, while many or most are used rarely. Although Zipf's law was first observed empirically in the field of quantitative linguistics, the same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, among others. Plotting rank-frequency curves In order to evaluate in a plot whether a measured rank-frequency relationship actually follows Zipf's law, it helps to take the base 10 logarithm on both sides of the equation above, leading to: logio (f(r)) = log10 (cr) = log₁0 (c) slog10 (r). Thus, if we rename y = logio (f(r)) and x = logio(r), and plot y versus x, then we should get a straight line with slopes and intercept logio (c). (Recall that the general equation of a straight-line y = g(x) is y = n + mx, where n is the intercept and m is the slope.) Texts subject to analysis. File format explained We will analyze four different classic books written in English, namely: . From the Earth to the Moon, by Jules Verne • Time Machine, by Herbert George Wells • The Picture of Dorian Gray, by Oscar Wilde • The Adventures of Sherlock Holmes, by Arthur Conan Doyle The source of these books will be Project Gutenberg, an electronic library with more than 60,000 Ebooks (at the date of writing). Project Gutenberg's library only includes "public domain" works that are out of copyright. You will find the pure text file (ASCII) corresponding to each of the four books in a folder available at the Moodle page of the unit. You have to download these files in the same folder of your system where you downloaded this notebook. A frequency table, main program data structure The main data structure of the code that you have to write is a frequency table. A frequency table is built out of a given text. It is a data structure which, given a word, provides its frequency in the text. The first big decision to be made in this assignment is: which is the most appropriate Python data structure to store the frequency table? An appropriate data structure should let you efficiently add new elements to the table in a dynamic way (this excludes NumPy arrays), and also should be able to provide the frequency associated to a word in time independent on the number of elements of the table (this excludes lists). Task 1 (2/22 points). Write a function that given an already opened file object, modifies the file object such that the file object is positioned in the first line of the body of the book. The function is though not to be a fruitful function, i.e., it does not return a value. Hint: you may consider the startswith method of type str in order to solve this task. (Recall that you can ask for help calling help (str. startswith) on a code cell.) Task 2 (4/22 points). Write a function that given a line in the body of the Ebook (as a string), and the frequency table, processes the line and updates the frequency table conformally with the contents of the line. The function is though not to be a fruitful function, i.e., it does not return a value. Some hints and considerations: • Before splitting the line in words, replace hyphens (i.e., "-") by blank spaces (i.e., *). You may consider useful to use the replace method of type str • After splitting the line in words, the resulting words may have leading and/or trailing punctuation signs or blank spaces. The string module provides the punctuation variable, which contains all English punctuation signs. Figure out how you can use the strip method of str and such a variable in order to get rid of the leading and/or trailing punctuation signs in each word. • We will neglect the case of letters in our analysis. Thus, for example, we will consider "The" and "the" to be the same word. Thus, before accessing the table, you must transform all letters of the word into lower case. Figure out how the lower method of str can be helpful for such purpose. Task 3 (4/22 points). Write a function that given a file's name of a plain text Gutenberg's Ebook, returns its associated frequency table. The function MUST use the functions written in Task 1 & 2. Task 4 (4/22 points). Write a function that given a frequency table, returns a list of (frequency, word) pairs (i.e., tuples of two elements) in descending order by frequency. Write in a text cell the answer to the following questions: Which are the top-3 words and associated frequencies in the four books subject of study? Describe in your own words how does the frequency decay with rank for the words in the top-3. Hint: you may consider the sort method of type list in order to solve this task. (Recall that you can ask for help calling help (list.sort) on a code cell.)
Expert Answer:
Answer rating: 100% (QA)
It appears youve provided a detailed description of a programming task related to analyzing text dat... View the full answer
Related Book For
Smith and Roberson Business Law
ISBN: 978-0538473637
15th Edition
Authors: Richard A. Mann, Barry S. Roberts
Posted Date:
Students also viewed these programming questions
-
Two parallel infinite conductive plates are placed at a distance d = 3.0 mm. These two plates carry surface charge densities +4 and -2, respectively, where = 1.1 nC/m2. What is the modulus of the...
-
Managing Scope Changes Case Study Scope changes on a project can occur regardless of how well the project is planned or executed. Scope changes can be the result of something that was omitted during...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Describe the typical terrorist cell.
-
Change in Estimate Depreciation Frederick Industries changed from the double-declining balance to the straight-line method in 2010 on all its plant assets. There was no change in the assets salvage...
-
Sketch the limacon r = 3 - 6 sin , and find the area of the region that is inside its large loop, but outside its small loop.
-
What are conversion costs? In a job costing system, at least some conversion costs are assigned directly to products. Why do all conversion costs need to be assigned to processing departments in a...
-
Duff Company is a subsidiary of Rand Corporation and is located in Madrid, Spain, where the currency is the euro (). Data on Duffs inventory and purchases are as follows: Inventory, January 1,...
-
8. A dielectric slab of dielectric constants k is slowly inserted inside the parallel plate capacitor having plate area A and separation between plates d as shown in figure. If dimensions of...
-
Each year, a shoe manufacturing company faces demands (which must be met on time) for pairs of shoes as shown in the file S13_40.xlsx. Employees work three consecutive quarters and then receive one...
-
A characteristic of FUTA is that it is imposed on both employer and employee. it is imposed solely on the employee. compliance requires following guidelines issued by both state and federal...
-
Stuart is a member of a registered pension scheme. He took no benefits from this or any other scheme until February 2018 , when he received a lump sum of 320,000 and began to receive a pension of...
-
How many of these allow inserting null values: ArrayList, LinkedList, HashSet, and TreeSet? A. 0 B. 1 C. 2 D. 3 E. 4
-
How many dimensions does the array reference moreBools allow? boolean[][] bools[], moreBools; A. One dimension B. Two dimensions C. Three dimensions D. None of the above
-
Which of the following references the first and last elements in a nonempty array? A. trains[0] and trains[trains.length] B. trains[0] and trains[trains.length - 1] C. trains[1] and...
-
Mail delivery during the Christmas holidays of 1990 to U.S. troops stationed in Saudi Arabia for Operation Desert Storm was haphazard. So many letters and packages were mailed during the holidays,...
-
You need 1.2 million pesos to start a community-based business and you have the following options for a loan. (A) payable in 10 years in equal annual installments of P150,000 (B) payable in 5 years,...
-
For the following exercises, rewrite the sum as a product of two functions or the product as a sum of two functions. Give your answer in terms of sines and cosines. Then evaluate the final answer...
-
Ben Collins was a full professor with tenure at Wisconsin State University in 2006. In March 2006 Parsons College, in an attempt to lure Dr. Collins from Wisconsin State, offered him a written...
-
Joanna takes a security interest in the equipment in Jason Store and files a financing statement claiming equipment and all after acquired equipment. Berkeley later sells Jason Store a cash register...
-
The subject of contention in this litigation is a valuable 17- story office building, located at 79 Madison Avenue in Manhattan. In dispute is the propriety of a complex series of transactions that...
-
Browne Cleaning and Gardening Services commenced on 1 June 2026 when Lorne Browne contributed \($120\) 000 into a business bank account. Perhaps more thought could have been given to the business...
-
What orientation of an electric dipole in a uniform electric field has the greatest electric potential energy? What orientation has the least? (Let the system comprise both the electric dipole and...
-
A proton, a deuteron (a hydrogen nucleus containing one proton and one neutron), and an alpha particle (a helium nucleus consisting of two protons and two neutrons) initially at rest are all...
Study smarter with the SolutionInn App