Question: Using Python 3.7 in Pycharm This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from

Using Python 3.7 in Pycharm

This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify: 1) the topic, 2) at least 10 related terms (could be single words or phrases), and 3) at least 2 seed URLs. In the crawling process, you need to determine whether a page is relevant to the topic: checking whether it contains at least 2 different related terms that you specified, before saving it into the crawled collection. The page-relevance checking process should be case-insensitive. For example, if the topic is Information Retrieval, related terms for the topic information retrieval might be: Information Retrieval, Crawler, Search Engine, tf-idf, Mean Average Precision, Precision, Recall, Relevance Feedback, Query Expansion, Retrieval Models, Boolean Model, Vector Space Model, and Language Model. You can use any programming language that you are comfortable with and you are free to reference codes from online for customization. Prepare a file folder which contains 2 sub-folders: 1. the first sub-folder has all the crawled pages. 2. the second sub-folder has the source code and a report. The report must have the followings: 2a. The topic of your choice, at least 10 related terms, and at least 2 seed URLs. 2b. How the crawler is implemented, number of pages crawled, and the URLs of all crawled pages

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!