What are the 50 most common words and their frequencies on the CDM website? Write python...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result to an output file. Specifications: 1. Start crawling from 'http://www.cdm.depaul.edu/' 2. Never visit the same page more than once. 3. Visit pages that are only WITHIN the cdm domain -the url's that have "http://www.cdm.depaul.edu/" in the beginning of the absolute URL. Do not visit external sites. 4. When you process the 'data' (processed by the 'handle_data(data)' function defined in the Python HTMLParser class, which is inherited in your 'Collector' class; assuming you used the code shown in the lecture PPT), convert all data to lower case. 5. In the 50 most common words, DO NOT include stopwords (e.g. 'the', 'a'). Stopwords, for the purpose of our assignment, are defined in the file "M6_stopwords.txt" (newly) posted on D2L, under the Module 06 Assignments. Some Hints: • After creating an absolute url (in Collector), if the final url contains either 'mailto' or 'img' or "course-evaluations', do NOT traverse the link. If you do, your code will error. In Python HTMLParser, when feed() is called, the order of the tag/data detection sequence is: ■ 1st handle_starttag() 2nd handle_data() 3rd handle_endtag() -- A big annoying difficulty is that, when you access the data returned from handle_data(), which you will have to override in your Collector class, the data was from irrelevant/unwanted sections, such as a section started by the tag <script>, <meta>, <script>, <code>. You do NOT want to process data from those sections. To that goal, what you can do is to first store the tag that was detected in the handle_starttag(). Then when handle_data() is invoked (automatically), you check the tag you stored for the data section, and if the tag was not one of the unwanted tags, you ignore the data extracted from the (tagged) section. You can use this list of unwanted tags: ['script', 'noscript', 'input', 'meta', 'title', 'style', 'form']. Be sure to remove punctuations, such as ,,,,?. '!', from the tokens in data. Note that there could be any number of punctuations (not just one) given to a word, such as "okay?!!" and "<-good". What are the 50 most common words and their frequencies on the CDM website? Write python code to answer this question. Write the result to an output file. Specifications: 1. Start crawling from 'http://www.cdm.depaul.edu/' 2. Never visit the same page more than once. 3. Visit pages that are only WITHIN the cdm domain -the url's that have "http://www.cdm.depaul.edu/" in the beginning of the absolute URL. Do not visit external sites. 4. When you process the 'data' (processed by the 'handle_data(data)' function defined in the Python HTMLParser class, which is inherited in your 'Collector' class; assuming you used the code shown in the lecture PPT), convert all data to lower case. 5. In the 50 most common words, DO NOT include stopwords (e.g. 'the', 'a'). Stopwords, for the purpose of our assignment, are defined in the file "M6_stopwords.txt" (newly) posted on D2L, under the Module 06 Assignments. Some Hints: • After creating an absolute url (in Collector), if the final url contains either 'mailto' or 'img' or "course-evaluations', do NOT traverse the link. If you do, your code will error. In Python HTMLParser, when feed() is called, the order of the tag/data detection sequence is: ■ 1st handle_starttag() 2nd handle_data() 3rd handle_endtag() -- A big annoying difficulty is that, when you access the data returned from handle_data(), which you will have to override in your Collector class, the data was from irrelevant/unwanted sections, such as a section started by the tag <script>, <meta>, <script>, <code>. You do NOT want to process data from those sections. To that goal, what you can do is to first store the tag that was detected in the handle_starttag(). Then when handle_data() is invoked (automatically), you check the tag you stored for the data section, and if the tag was not one of the unwanted tags, you ignore the data extracted from the (tagged) section. You can use this list of unwanted tags: ['script', 'noscript', 'input', 'meta', 'title', 'style', 'form']. Be sure to remove punctuations, such as ,,,,?. '!', from the tokens in data. Note that there could be any number of punctuations (not just one) given to a word, such as "okay?!!" and "<-good".
Expert Answer:
Answer rating: 100% (QA)
Here is the Python code to find the 50 most common words and their frequencies on the CDM website PYTHON import requests from bs4 import BeautifulSoup ... View the full answer
Related Book For
Income Tax Fundamentals 2013
ISBN: 9781285586618
31st Edition
Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill
Posted Date:
Students also viewed these programming questions
-
Case Study: Quick Fix Dental Practice Technology requirements Application must be built using Visual Studio 2019 or Visual Studio 2017, professional or enterprise. The community edition is not...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
There are 4 models of Corporate Social Responsibility. I. Economic Model a. Profit-Based SocialResponsibility aka Economic Model : Milton Friedman's 1970New York Times article "The Social...
-
47GP: Chapter: CH0 CH1 CH2 CH3 CH4 CH5 CH6 CH7 CH8 CH9 CH10 CH11 CH12 CH13 CH14 CH15 CH16 CH17 CH18 CH19 CH20 CH21 CH22 CH23 CH24 CH25 CH26 CH27 CH28 CH29 CH30 Problem: 1CQ 1MCP 1P 2CQ 2MCP 2P 3CQ...
-
Suppose that an electron trapped in a one-dimensional infinite well of width 250 pm is excited from its first excited state to its third excited state. (a) What energy must be transferred to the...
-
Let A be an m x n matrix of rank n and let be Rm. If Q and R are the matrices derived from applying the Gram-Schmidt process to the column vectors of A and p = c1q1 + c2q2 + --- + cnqn is the...
-
Verify that in Example 9.6 the momentum transferred to the ball by the gravitational force is \(\Delta p_{x}=m_{\mathrm{b}}\left(v_{x, \mathrm{f}}-v_{x, \mathrm{i}} ight)\). Data from Example 9.6 A...
-
Breathtakers, a health and fitness center, operates a morning fitness program for senior citizens. The program includes aerobic exercise, either swimming or step exercise, followed by a healthy...
-
A hiker and his dog have discovered a skeleton hidden deep in the woods. The medical examiner has identified this skeleton as female. What characteristics of the skeleton would help the examiner make...
-
Kofi Allon, who is 32 years old and single, is employed as a technical consultant for a large electronics distributor. He lives at 678 Birch Street, LaMesa, CA 91941. Kofi's Social Security number is...
-
Strategies and decisions are taken at various hierarchal levels in an organization. Explain different levels of hierarchy in organisations and discuss the various Strategic decisions taken at these...
-
True Or False The plaintiff in a misuse-of-legal-process case was originally the defendant in the cause of action leading up to the misuse-of-legalprocess case.
-
True Or False A plaintiff suing for private nuisance can always seek to recover compensatory damages or to obtain an injunction.
-
To what group of persons is a defendant liable if they make a misrepresentation?
-
What does it mean that the plaintiff must have actively participated in instigating the prosecution in malicious prosecution?
-
True Or False Offering a better price to a third person, knowing that this could induce this person to breach their contract with the plaintiff, does not constitute interference with contractual...
-
State whether the following statement is true or false. Explain. The graph of y=-4(x-3)2-20 has no x-intercepts. Select the correct choice below and, if necessary, fill in the answer box to complete...
-
You have just begun your summer internship at Omni Instruments. The company supplies sterilized surgical instruments for physicians. To expand sales, Omni is considering paying a commission to its...
-
Phil and Linda are 25-year-old newlyweds and file a joint tax return. Linda is covered by a retirement plan at work, but Phil is not. a. Assuming Phil's wages were $27,000 and Linda's wages were...
-
Calculate the amount of the child and dependent care credit allowed for 2012 in each of the following cases, assuming the taxpayers had no income other than the stated amounts. a. William and Carla...
-
Tom has a successful business with $100,000 of income in 2012. He purchases one new asset in 2012, a new machine which is 7-year MACRS property and costs $25,000. If you are Tom's tax advisor, how...
-
A blender does 5000 J of work on the food in its bowl. During the time the blender runs, 2000 J of heat is transferred from the warm food to the cooler environment. What is the change in the thermal...
-
Which system contains more atoms: 5 mol of helium (A = 4) or 1 mol of neon (A = 20)? A. Helium B. Neon C. They have the same number of atoms.
-
What is the ratio T f /T i for this process? A. 1/4 B. 1/2 C. 1 (no change) D. 2 E. 4 F. There is not enough information to decide. p (atm) 4 3- 2 1 0- 0 2 -V (m)
Study smarter with the SolutionInn App