Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a...
Fantastic news! We've Found the answer you've been seeking!
Question:
![](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2023/09/64f1897e9a61e_1693550972314.jpg)
![](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2023/09/64f1899b9493a_1693551001491.jpg)
Transcribed Image Text:
Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a sample text (e.g., a book), and a word within the text. the frequency of the word is defined as the number of occurrences of the word within the text. The rank of a word is defined as the position of the word in a ranking of words by frequency (in descending order). Thus, the most common word, i.e., the one with highest frequency, has rank 1, the second most common word has rank 2, and so on. If we denote by r the rank of a word, and by f its frequency, then, accordingly to Zipf's law, we have that: f(r) = cr³. where c and s are parameters which depend on the particular text and language. It turns out that, for most texts and languages, s is (almost) 1, i.e., f is (almost) inversely proportional to r. This means that the frequency of word with rank 2 will be approximately half the one of the most common word, the one with rank 4 approximately a factor of 4 smaller, and so on. In words, Zipt's law states that only a few words are used very often, while many or most are used rarely. Although Zipf's law was first observed empirically in the field of quantitative linguistics, the same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, among others. Plotting rank-frequency curves In order to evaluate in a plot whether a measured rank-frequency relationship actually follows Zipf's law, it helps to take the base 10 logarithm on both sides of the equation above, leading to: logio (f(r)) = log10 (cr) = log₁0 (c) slog10 (r). Thus, if we rename y = logio (f(r)) and x = logio(r), and plot y versus x, then we should get a straight line with slopes and intercept logio (c). (Recall that the general equation of a straight-line y = g(x) is y = n + mx, where n is the intercept and m is the slope.) Texts subject to analysis. File format explained We will analyze four different classic books written in English, namely: . From the Earth to the Moon, by Jules Verne • Time Machine, by Herbert George Wells • The Picture of Dorian Gray, by Oscar Wilde • The Adventures of Sherlock Holmes, by Arthur Conan Doyle The source of these books will be Project Gutenberg, an electronic library with more than 60,000 Ebooks (at the date of writing). Project Gutenberg's library only includes "public domain" works that are out of copyright. You will find the pure text file (ASCII) corresponding to each of the four books in a folder available at the Moodle page of the unit. You have to download these files in the same folder of your system where you downloaded this notebook. A frequency table, main program data structure The main data structure of the code that you have to write is a frequency table. A frequency table is built out of a given text. It is a data structure which, given a word, provides its frequency in the text. The first big decision to be made in this assignment is: which is the most appropriate Python data structure to store the frequency table? An appropriate data structure should let you efficiently add new elements to the table in a dynamic way (this excludes NumPy arrays), and also should be able to provide the frequency associated to a word in time independent on the number of elements of the table (this excludes lists). Task 1 (2/22 points). Write a function that given an already opened file object, modifies the file object such that the file object is positioned in the first line of the body of the book. The function is though not to be a fruitful function, i.e., it does not return a value. Hint: you may consider the startswith method of type str in order to solve this task. (Recall that you can ask for help calling help (str. startswith) on a code cell.) Task 2 (4/22 points). Write a function that given a line in the body of the Ebook (as a string), and the frequency table, processes the line and updates the frequency table conformally with the contents of the line. The function is though not to be a fruitful function, i.e., it does not return a value. Some hints and considerations: • Before splitting the line in words, replace hyphens (i.e., "-") by blank spaces (i.e., *). You may consider useful to use the replace method of type str • After splitting the line in words, the resulting words may have leading and/or trailing punctuation signs or blank spaces. The string module provides the punctuation variable, which contains all English punctuation signs. Figure out how you can use the strip method of str and such a variable in order to get rid of the leading and/or trailing punctuation signs in each word. • We will neglect the case of letters in our analysis. Thus, for example, we will consider "The" and "the" to be the same word. Thus, before accessing the table, you must transform all letters of the word into lower case. Figure out how the lower method of str can be helpful for such purpose. Task 3 (4/22 points). Write a function that given a file's name of a plain text Gutenberg's Ebook, returns its associated frequency table. The function MUST use the functions written in Task 1 & 2. Task 4 (4/22 points). Write a function that given a frequency table, returns a list of (frequency, word) pairs (i.e., tuples of two elements) in descending order by frequency. Write in a text cell the answer to the following questions: Which are the top-3 words and associated frequencies in the four books subject of study? Describe in your own words how does the frequency decay with rank for the words in the top-3. Hint: you may consider the sort method of type list in order to solve this task. (Recall that you can ask for help calling help (list.sort) on a code cell.) Zipf's law describes a relationship among ranks and frequencies of words in natural languages. Given a sample text (e.g., a book), and a word within the text. the frequency of the word is defined as the number of occurrences of the word within the text. The rank of a word is defined as the position of the word in a ranking of words by frequency (in descending order). Thus, the most common word, i.e., the one with highest frequency, has rank 1, the second most common word has rank 2, and so on. If we denote by r the rank of a word, and by f its frequency, then, accordingly to Zipf's law, we have that: f(r) = cr³. where c and s are parameters which depend on the particular text and language. It turns out that, for most texts and languages, s is (almost) 1, i.e., f is (almost) inversely proportional to r. This means that the frequency of word with rank 2 will be approximately half the one of the most common word, the one with rank 4 approximately a factor of 4 smaller, and so on. In words, Zipt's law states that only a few words are used very often, while many or most are used rarely. Although Zipf's law was first observed empirically in the field of quantitative linguistics, the same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, among others. Plotting rank-frequency curves In order to evaluate in a plot whether a measured rank-frequency relationship actually follows Zipf's law, it helps to take the base 10 logarithm on both sides of the equation above, leading to: logio (f(r)) = log10 (cr) = log₁0 (c) slog10 (r). Thus, if we rename y = logio (f(r)) and x = logio(r), and plot y versus x, then we should get a straight line with slopes and intercept logio (c). (Recall that the general equation of a straight-line y = g(x) is y = n + mx, where n is the intercept and m is the slope.) Texts subject to analysis. File format explained We will analyze four different classic books written in English, namely: . From the Earth to the Moon, by Jules Verne • Time Machine, by Herbert George Wells • The Picture of Dorian Gray, by Oscar Wilde • The Adventures of Sherlock Holmes, by Arthur Conan Doyle The source of these books will be Project Gutenberg, an electronic library with more than 60,000 Ebooks (at the date of writing). Project Gutenberg's library only includes "public domain" works that are out of copyright. You will find the pure text file (ASCII) corresponding to each of the four books in a folder available at the Moodle page of the unit. You have to download these files in the same folder of your system where you downloaded this notebook. A frequency table, main program data structure The main data structure of the code that you have to write is a frequency table. A frequency table is built out of a given text. It is a data structure which, given a word, provides its frequency in the text. The first big decision to be made in this assignment is: which is the most appropriate Python data structure to store the frequency table? An appropriate data structure should let you efficiently add new elements to the table in a dynamic way (this excludes NumPy arrays), and also should be able to provide the frequency associated to a word in time independent on the number of elements of the table (this excludes lists). Task 1 (2/22 points). Write a function that given an already opened file object, modifies the file object such that the file object is positioned in the first line of the body of the book. The function is though not to be a fruitful function, i.e., it does not return a value. Hint: you may consider the startswith method of type str in order to solve this task. (Recall that you can ask for help calling help (str. startswith) on a code cell.) Task 2 (4/22 points). Write a function that given a line in the body of the Ebook (as a string), and the frequency table, processes the line and updates the frequency table conformally with the contents of the line. The function is though not to be a fruitful function, i.e., it does not return a value. Some hints and considerations: • Before splitting the line in words, replace hyphens (i.e., "-") by blank spaces (i.e., *). You may consider useful to use the replace method of type str • After splitting the line in words, the resulting words may have leading and/or trailing punctuation signs or blank spaces. The string module provides the punctuation variable, which contains all English punctuation signs. Figure out how you can use the strip method of str and such a variable in order to get rid of the leading and/or trailing punctuation signs in each word. • We will neglect the case of letters in our analysis. Thus, for example, we will consider "The" and "the" to be the same word. Thus, before accessing the table, you must transform all letters of the word into lower case. Figure out how the lower method of str can be helpful for such purpose. Task 3 (4/22 points). Write a function that given a file's name of a plain text Gutenberg's Ebook, returns its associated frequency table. The function MUST use the functions written in Task 1 & 2. Task 4 (4/22 points). Write a function that given a frequency table, returns a list of (frequency, word) pairs (i.e., tuples of two elements) in descending order by frequency. Write in a text cell the answer to the following questions: Which are the top-3 words and associated frequencies in the four books subject of study? Describe in your own words how does the frequency decay with rank for the words in the top-3. Hint: you may consider the sort method of type list in order to solve this task. (Recall that you can ask for help calling help (list.sort) on a code cell.)
Expert Answer:
Answer rating: 100% (QA)
It appears youve provided a detailed description of a programming task related to analyzing text dat... View the full answer
Related Book For
Smith and Roberson Business Law
ISBN: 978-0538473637
15th Edition
Authors: Richard A. Mann, Barry S. Roberts
Posted Date:
Students also viewed these programming questions
-
Two parallel infinite conductive plates are placed at a distance d = 3.0 mm. These two plates carry surface charge densities +4 and -2, respectively, where = 1.1 nC/m2. What is the modulus of the...
-
Managing Scope Changes Case Study Scope changes on a project can occur regardless of how well the project is planned or executed. Scope changes can be the result of something that was omitted during...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Describe the typical terrorist cell.
-
Use the same abbreviations as in 1-29 to classify each of the following transactions according to whether they are operating, financing, or investing activities: In 1-29 OA Operating activities item...
-
A particle traveling at a speed of 6.50 10 6 m/s has the uncertainty in its position given by its de Broglie wavelength. What is the minimum uncertainty in the speed of the particle?
-
2. Since growth is stable for ApparelCo, you decide to start the continuing value with year 3 cash flows (i.e., cash flows in year 3 and beyond are part of the continuing value). Using the key value...
-
Hiatt Company sells automatic can openers under a 75-day warranty for defective merchandise. Based on past experience, Hiatt estimates that 3% of the units sold will become defective during the...
-
Come - Clean Corporation produces a variety of cleaning compounds including Grit 3 3 7 and Sparkle silver polish. Grit 3 3 7 is a coarse cleaning powder that costs $ 1 . 6 0 a pound to make and sells...
-
Phoenix Company reports the following fixed budget. It is based on an expected production and sales volume of 15,200 units. Sales Costs Direct materials Direct labor Sales staff commissions...
-
7 (1 point) Dividing work into more specialized jobs A) reduces work efficiency. B) allows job incumbents to master their tasks quickly. C) reduces the opportunity to match people with approp
-
Global Operations Management is supported by Strategic Supply Chain Management in many ways. Elucidate the following; List and briefly define/describe the Five (5) Components of Strategic Supply...
-
The Alpine House, Inc. is a large winter sports equipment broker. Below is an income statement for the company's ski department for a recent quarter. LA CASA ALPINA, INC. Income Statement - Ski...
-
Two investment portfolios are shown. Investment Portfolio 1 Portfolio 2 ROR Savings Account $1,425 $4,500 2.80% Government Bond $1,380 $3,600 1.55% Preferred Stock $3,400 $2,150 11.70% Common Stock...
-
The following information pertains to JAE Corporation at January 1, Year 1: Common stock, $8 par, 11,000 shares authorized, 2,200 shares issued and outstanding Paid-in capital in excess of par,...
-
Group dynamics are important elements within the leading facet of the P-O-L-C framework. Discuss a time in your professional, school, or personal life when you experienced the Five Stages of Group...
-
. Assume a $100,000 par value. What is the yield to maturity of the August 2000 Treasury bond with semiannual payment? Compare the yield to maturity and the current yield. How do you explain this...
-
(a) What is the focal length of a magnifying glass that gives an angular magnification of 8.0 when the image is at infinity? (b) How far must the object be from the lens?
-
Ben Collins was a full professor with tenure at Wisconsin State University in 2006. In March 2006 Parsons College, in an attempt to lure Dr. Collins from Wisconsin State, offered him a written...
-
Joanna takes a security interest in the equipment in Jason Store and files a financing statement claiming equipment and all after acquired equipment. Berkeley later sells Jason Store a cash register...
-
The subject of contention in this litigation is a valuable 17- story office building, located at 79 Madison Avenue in Manhattan. In dispute is the propriety of a complex series of transactions that...
-
How does the asset structure of credit unions compare with the asset structure of commercial banks and savings institutions? Refer to Tables 25 , 29 , and 212 to formulate your answer. LO.1
-
What is the common bond membership qualification under which credit unions have been formed and operated? How does this qualification affect the operational objective of a credit union? LO.1
-
How do savings banks differ from savings associations? Differentiate in terms of risk, operating performance, balance sheet structure, and regulatory responsibility. LO.1
![Mobile App Logo](https://dsd5zvtm8ll6.cloudfront.net/includes/images/mobile/finalLogo.png)
Study smarter with the SolutionInn App