Question: Spark Code in java : You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline
Spark Code in java :
You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline is to take in a large set of text documents and a set of user defined queries, then for each query, rank the text documents by relevance for that query, as well as filter out any overly similar documents in the final ranking. The top 10 documents for each query should be returned as output. Each document and query should be processed to remove stopwords (words with little discriminative value, e.g. the) and apply stemming (which converts each word into its stem, a shorter version that helps with term mismatch between documents and queries). Documents should be scored using the DPH ranking model. As a final stage, the ranking of documents for each query should be analysed to remove unneeded redundancy (near duplicate documents), if any pairs of documents are found where their titles have a textual distance (using a comparison function provided) less than 0.5 then you should only keep the most relevant of them (based on the DPH score). Note that there should be 10 documents returned for each query, even after redundancy filtering. You will be provided with a Java template project like the tutorials already provided. Your role is to implement the necessary Spark functions to get from a Dataset (the input documents) and a Dataset (the queries to rank for) to a List (a ranking of 10 documents for each query). Your solution should only include spark transformations and actions, apart from any final processing you choose to do within the driver program. You should not perform any offline computation (e.g. pre-constructing a search index), i.e. all processing should happen during the lifecycle of the Spark app. The template project provides implementations of the following code to help you: Loading of the query set and converting it to a Dataset. Loading of the news articles and converting it to a Dataset A static text pre-processor function that converts a piece of text to its tokenised, stopword removed and stemmed form. This function takes in a String (the input text) and outputs a List (the remaining terms from the input after tokenization, stemming and stopword removal). A static DPH scoring function that calculates a score for a pair given the following information: o Term Frequency (count) of the term in the document o The length of the document (in terms) o The average document length in the corpus (in terms) o The total number of documents in the corpus o The sum of term frequencies for the term across all documents A static string distance function that takes two strings and calculates a distance value between them (within a 0-1 range). The DPH score for a pair is the average of the DPH scores for each pair (for each term in the query). When designing your solution, you should primarily be thinking about how you can efficiently calculate the statistics needed to score each document for each query using DPH.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
