Question: This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify:

This assignment requires you to develop a topical/focused crawler to crawl 500 pages, for a topic of your choice, from Wikipedia. You need to specify: 1) the topic, 2) at least 10 related terms (could be single words or phrases), and 3) at least 2 seed URLs. In the crawling process, you need to determine whether a page is relevant to the topic: checking whether it contains at least 2 different related terms that you specified, before saving it into the crawled collection. The page-relevance checking process should be case-insensitive. For example, if the topic is Information Retrieval, the seed URLs can be:

http://en.wikipedia.org/wiki/Information_retrieval and http://en.wikipedia.org/wiki/Search_engine_(computing).

Example related terms for the topic information retrieval might be: Information Retrieval, Crawler, Search Engine, tf-idf, Mean Average Precision, Precision, Recall, Relevance Feedback, Query Expansion, Retrieval Models, Boolean Model, Vector Space Model, and Language Model.

PYTHON PLEASE WITH COMMENTS!!!!

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!