Search engines have a central role in information retrieval from the web. More than 90% of...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Search engines have a central role in information retrieval from the web. More than 90% of online experiences begin with a search engine. Search engines have a very complicated architecture but it has three major components: crawling, indexing and ranking. Crawling is about exploring the web to find new pages and indexing the process of storing information in a way for faster information retrieval. In this assignment, your focus will be on the search engine ranking component. The estimated number of web pages indexed in Google is about 50 billion. Returning pages most relevant to a query is one of the most important measures to evaluate a search engine's performance. And this is all about the ranking algorithm that a search engine uses. Two important sources that search engines use in their ranking algorithms include the content of web pages and links between them. In this part of the assignment, you will develop a content-based search engine called, GoDuckDuck. Given a query, GoDuckDuck task is to explore a set of text files, and employ a ranking algorithm to return the most relevant file with regard to the given query considering the files' content. To implement GoDuckDuck first, we need to define the ranking algorithm we use. For a given query, the ranking algorithm computes a relevance score for each document; a greater relevance score ranks a document higher. The main input of the ranking algorithm is the frequency of query words in different documents: frequency(word, document) = number of occurrences of word in document And the score of a document with respect to a query is computed as follows: score(query, document) = frequency (word, document) word in query meaning that the score is the total sum of the frequency of each word in the document. The documents are ranked based on their scores. Imagine your document content is: "they ran into many problems in the last year. They are trying to solve them.". And the user query is "they went". First, we parse the query to get the single words in the query including "they" and "went". The frequency of these words are frequency (they, document) = 2 frequency(went, document) = 0 Search engines have a central role in information retrieval from the web. More than 90% of online experiences begin with a search engine. Search engines have a very complicated architecture but it has three major components: crawling, indexing and ranking. Crawling is about exploring the web to find new pages and indexing the process of storing information in a way for faster information retrieval. In this assignment, your focus will be on the search engine ranking component. The estimated number of web pages indexed in Google is about 50 billion. Returning pages most relevant to a query is one of the most important measures to evaluate a search engine's performance. And this is all about the ranking algorithm that a search engine uses. Two important sources that search engines use in their ranking algorithms include the content of web pages and links between them. In this part of the assignment, you will develop a content-based search engine called, GoDuckDuck. Given a query, GoDuckDuck task is to explore a set of text files, and employ a ranking algorithm to return the most relevant file with regard to the given query considering the files' content. To implement GoDuckDuck first, we need to define the ranking algorithm we use. For a given query, the ranking algorithm computes a relevance score for each document; a greater relevance score ranks a document higher. The main input of the ranking algorithm is the frequency of query words in different documents: frequency(word, document) = number of occurrences of word in document And the score of a document with respect to a query is computed as follows: score(query, document) = frequency (word, document) word in query meaning that the score is the total sum of the frequency of each word in the document. The documents are ranked based on their scores. Imagine your document content is: "they ran into many problems in the last year. They are trying to solve them.". And the user query is "they went". First, we parse the query to get the single words in the query including "they" and "went". The frequency of these words are frequency (they, document) = 2 frequency(went, document) = 0
Expert Answer:
Related Book For
Strategic Management An Integrated Approach
ISBN: 978-1111825843
10th edition
Authors: Charles W. L. Hill, Gareth R. Jones
Posted Date:
Students also viewed these programming questions
-
I appreciate your perspective on the key components of materials management. I concur that Materials Management in healthcare is one of the critical departments, and its main objective is to...
-
Googles ease of use and superior search results have propelled the search engine to its num- ber one status, ousting the early dominance of competitors such as WebCrawler and Infos- eek. Even later...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
X-Tech Inc. produces specialized bolts for the aerospace industry. The operating cost of producing a single bolt is $2. The company currently sells the bolts for $6/unit. Each time the company...
-
As originally constructed, the center span of the George Washington Bridge consisted of a uniform roadway suspended from four cables. The uniform load supported by each cable was w = 9.75 kips/ft...
-
Based on your knowledge of fire safety principles, suggest the provisions for fire safety in the apartment blocks comprising of 5-storey residential apartments, 5 mews houses and 52sqm of commercial...
-
(The Martingale problem on \(L)\) Let \(f, g)\) be a pair of functions in \(L\). Find a process \(\left\{X_{t} ight\}_{t}\) defined on \(E\) such that \[M_{t}=f\left(X_{t} ight)-f\left(X_{0}...
-
Metros computer system generated the following trial balance on December 31, 2011. The companys manager knows that the trial balance is wrong because it does not show any balance for Goods in Process...
-
A string of length 0.650 m and a linear density of 2.02 x 10-3 kg/m is vibrating in the fundamental harmonic mode even as the tension is increased over time. How many oscillations does the string go...
-
Janice Morgan, age 24, is single and has no dependents. She is a freelance writer. In January 2021, Janice opened her own office located at 2751 Waldham Road, Pleasant Hill, NM 88135. She called her...
-
Let = {A,: nEN}, where An= for n =1,2,3,..... Then UAn n,
-
Provide trace tables for these loops. a. b. c. d. int i=0; int j = 10; int n = 0; while (ij) { i++; j-; n++; }
-
MapReduce enables large amounts of parallelism by having data-independent tasks run on multiple nodes, often using commodity hardware; however, there are limits to the level of parallelism. For...
-
Although request-level parallelism allows many machines to work on a single problem in parallel, thereby achieving greater overall performance, one of the challenges is how to avoid dividing the...
-
With CUDA we can use coarse-grain parallelism at the block level to compute the conditional likelihood of multiple nodes in parallel. Assume that we want to compute the conditional likelihood from...
-
To achieve a lower OPEX, one appealing alternative is to use low-power versions of servers to reduce the total electricity required to run the servers; however, similar to high-end servers, low-power...
-
1. Select the incorrect statement regarding costs andexpenses. a. Some costs are initially recorded as expenses while othersare initially recorded as assets. b. Expenses are incurred when assets are...
-
PC Contractors, Inc., was an excavating business in Kansas City, Missouri. Union Bank made loans to PC, subject to a perfected security interest in its equipment and other assets, including...
-
What are some of the reasons companies lose control over their business models, and thus their competitive advantage, over time?
-
What are the problems associated with implementing a strategy of related diversification through acquisitions?
-
1. According to Porter's framework, what generic strategy was Airborne Express pursuing? Was this a sound strategy in the context of the air express industry? 2. What were the strengths of Airborne...
-
An option based on a variable that is not traded is called a real option or sometimes a soft option. Find the projection price of the soft option with the following parameters and compare with the...
-
A stock price is governed by \[\frac{\mathrm{d} S}{S}=\mu \mathrm{d} t+\sigma \mathrm{d} z\] where \(z\) is a standardized Wiener process. Interest is constant at rate \(r\). An investor wishes to...
-
At the beginning of April one year, the silver forward prices (in cents per troy ounce) were as follows: The carrying cost of silver is about 20 cents per ounce per year, paid at the beginning of...
Study smarter with the SolutionInn App