Question: Spark Code in java : You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline

Spark Code in java :

You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline is to take in a large set of text documents and a set of user defined queries, then for each query, rank the text documents by relevance for that query, as well as filter out any overly similar documents in the final ranking. The top 10 documents for each query should be returned as output. Each document and query should be processed to remove stopwords (words with little discriminative value, e.g. the) and apply stemming (which converts each word into its stem, a shorter version that helps with term mismatch between documents and queries). Documents should be scored using the DPH ranking model. As a final stage, the ranking of documents for each query should be analysed to remove unneeded redundancy (near duplicate documents), if any pairs of documents are found where their titles have a textual distance (using a comparison function provided) less than 0.5 then you should only keep the most relevant of them (based on the DPH score). Note that there should be 10 documents returned for each query, even after redundancy filtering. You will be provided with a Java template project like the tutorials already provided. Your role is to implement the necessary Spark functions to get from a Dataset (the input documents) and a Dataset (the queries to rank for) to a List (a ranking of 10 documents for each query). Your solution should only include spark transformations and actions, apart from any final processing you choose to do within the driver program. You should not perform any offline computation (e.g. pre-constructing a search index), i.e. all processing should happen during the lifecycle of the Spark app. The template project provides implementations of the following code to help you: Loading of the query set and converting it to a Dataset. Loading of the news articles and converting it to a Dataset A static text pre-processor function that converts a piece of text to its tokenised, stopword removed and stemmed form. This function takes in a String (the input text) and outputs a List (the remaining terms from the input after tokenization, stemming and stopword removal). A static DPH scoring function that calculates a score for a pair given the following information: o Term Frequency (count) of the term in the document o The length of the document (in terms) o The average document length in the corpus (in terms) o The total number of documents in the corpus o The sum of term frequencies for the term across all documents A static string distance function that takes two strings and calculates a distance value between them (within a 0-1 range). The DPH score for a pair is the average of the DPH scores for each pair (for each term in the query). When designing your solution, you should primarily be thinking about how you can efficiently calculate the statistics needed to score each document for each query using DPH.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

What is the conceptual problem with the following snippet of Apache Spark code meant to work on very large data. Note that the collect() function returns a Java collection, and Java collections (from...

Section C - Concepts in Apache Spark and Distributed Computing Please answer in full sentence(s) (max 3), per question, as appropriate. Use the handout to help you. Marks will be awarded based on the...

Chapter 1 Introduction to Computers, Programs, and Java 1. A Java program with more than one class will run. a. True b. False 2. A Java program can have more than one class but all the classes must...

Which statement is true about Spark and SparkSQL? Writing raw Spark code will always result in faster computation than writing SparkSQL Writing SparkSQL will always result in faster computation than...

Need help with the following questions. You must define/describe/explain the following questions (50%) and include a sample Java code (50%). 1. Object-oriented programming ultimately requires a solid...

I only need answers for 3.1, 3.4, 3.5, 4.3, 5.3, 5.5, 5.6, 6.3, 6.8 (b, c, f, h, k, m), and 7.2. 1 Submission Instructions Note that for each homework assignment, only some of the exercises will be...

I need java code. Work on the rest, one is allready done. . TwoDimArray code Animation of Multidimensional Programming Activity Lab Objectives - Be able to declare two-dimensional arrays - Be able to...

when i run TestShapes.java i get error "could not find or load main class TestShapes. I am unable to figure out what is wrong with my code. I need help fixing my java code. Go Tests Run Terminal Help...

Need Help with the following.. /////////////////////// origional question //////////////// ////////////////////////// Returned comments //////////////////////////////////// The program submitted...

PLEASE Explain the approach taken to complete this assignment and the major decisions made. In this assessment, you will design and code a Java console application that defines a class, extends it...

A subtle example that requires some thought is provided by the rank order function, which sorts vectors into descending order. Let r:

Jay, a public limited company, has acquired the following shareholdings in Gee and Hem, both public limited companies. The following statement of financial positions relate to Jay, Gee and Hem at 31...

Quantification of Costs Types of Costs Over the Projects Lifetime ( in million USD ) : 1 . Construction Costs ( year 0 - 2 ) : 2 0 $M 2 . Land Acquisition ( year 0 ) : 1 0 $M 3 . Infrastructure...

Please make it fast 6 4 1 .

Explain why the following statements are false. a. The aggregate-demand curve slopes downward because it is the horizontal sum of the demand curves for individual goods. b. The long-run...

Suppose that the economy is currently in a recession. If policymakers take no action, how will the economy change over time? Explain in words and using an aggregate-demand/ aggregate-supply diagram.

Suppose the U.S. economy begins in long-run equilibrium. Concerns about global climate change cause the government to significantly restrict the production of electricity from fossil fuels. Because...