Question: Basic Java: Intro to web search course Elastic search Assignment Assignment 1: indexing for web search Your task is to apply your IR skills to

Basic Java: Intro to web search course

Elastic search Assignment

Assignment 1: indexing for web search

Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured knowledge. Your system should take HTML pages as input, process them using the kind of techniques that we have been looking at in the module, and output an index of terms identified in the documents.

This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages. You might also implement a system that does not strictly follow the stages but will work in the same way. The stages are as follows:

Input/Output (10%) The system must be able to read Web pages (a small number will do here and they can be stored locally) and produce appropriately formatted output. The Web pages should be processed one at a time using the steps outlined below.

HTML Parsing (10%) Before the text can be analyzed it is necessary to get rid of the HTML tags. The result will be plain text. Note that if you simply delete all HTML tags, you will lose information such as meta tag keywords. Therefore, I strongly suggest that you use some tool to perform this task.

Pre-processing: Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice.

Part-of-Speech Tagging (10%) The input should be tagged with a suitable part-of-speech tagger, so that the result can then be processed in the next steps.

Selecting Keywords (20%) One aim of your system is to identify the words or phrases in the text that are most useful for indexing purposes. Your system should remove words which are not useful, such as very frequent words or stopwords. You should develop a selection method, possibly using POS tags (e.g. nouns and noun phrases) in combination with statistical/frequency information (e.g. using term frequency).

Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly the same thing even though they are di erent words.

Engineering a Complete System (10%) The nal system should have control over all the individual components so that there is a single call and all the above steps will be performed.

You will have noticed that the percentages above only add up to 80%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 20% of your mark will come from this. You should submit:

A description of your implementation: what the code does, and the software you used

Unedited and commented output from a run of the code submitted using these web pages:

(feel free to submit other runs as well, i.e. using Web pages of your own choice)

A short discussion of your solution focussing on functionality implemented and possible improvements and extensions.

You can implement your system either on the Linux or the Windows machines. Perl, Java, Python, C/C++, and shell scripts are good choices for this project, but you are by no means restricted to those languages. Identify suitable open-source tools that help you building your pipeline.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Basic Java! / intro to web search course / elasticsearch Assignment: indexing for web search Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured...

ORGANIZ/fIION DE\\IELOPMENT 4t XieS&r& L:rlt ttttrc DONALD R.BRO\\MN i',ii+ir+,::':i'i Organlzation Renewal: The Challenge-of Change LEARNING OBJECTIVES Upon completing this chapter, you will be able...

Information Retrieval Juntao Yu February 2022 Plagiarism You are reminded that this work is for credit towards the composite mark in CE706, and that the work you submit must therefore be your own....

KINGS OWN INSTITUTE* Success in Higher Education ICT106 DATA COMMUNICATIONS AND NETWORKS T223 Page 1 of 18 AUSTRALIAN INSTITUTE OF BUSINESS AND MANAGEMENT PTY LTD ABN: 72 132 629 979 CRICOS 03171A...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

Case study: Remedy Physiotherapy. I need help in drafting a marketing plan Names of people and businesses are disguised. Some aspects of the local area and the physiotherapy industry are simplified...

I have to create a program in C and I can't figure it out. The program has to read a source file. Please help. /******************************************************************** PROJECT: Glossary...

UMUC Haircuts Appointment Process Individual Needs Appointment for Hair Styling Calls UMUC Haircuts and requests appointment Drives to UMUC Haircuts 1 Employee greets customer and asks customer last...

A bottle with a volume of 0.1 m3 contains butane with a quality of 75% and a temperature of 300 K. Estimate the total butane mass in the bottle using the generalized compressibility chart.

(a) Write the chemical equations that are used in calculating the lattice energy of SrCl2(s) via a Born-Haber cycle. (b) The second ionization energy of Sr(g) is 1064kJ/mol. Use this fact along with...

Which of the following situations represent saving? Check all that apply. Your family takes out a mortgage and buys a new house. You borrow $ 1 , 0 0 0 from a bank to buy a car to use in your pizza...

2. Prepare a statement of stockholders equity for the year ended December 31, 20Y5. Refer to the lists of Accounts, Labels and Amount Descriptions for the exact wording of the answer choices for text...

=+j Describe the evolution and make-up of global industrial relations.

=+How about the view of legal systems and cultures in various countries? How are they changing?

=+ Have they changed the way employers view IP?