Question: Implement an efficient data layout and retrieval strategy for a Hadoop Cluster Overview & background: A multinational financial services company has a large volume of

Implement an efficient data layout and retrieval strategy for a Hadoop Cluster Overview & background: A multinational financial services company has a large volume of financial transaction data generated from its branches and online services. The financial transaction data is generated in real-time and is too large to be processed and analyzed using traditional methods. The company needs a scalable and flexible big data solution that can handle the volume, velocity and variety of the data. The company wants to use big data technologies to store, process and analyze the data to identify trends, detect fraud and make informed business decisions. The company has decided to use a Hadoop cluster with HDFS as its storage system and MapReduce for processing and analysis. The Hadoop cluster, with HDFS as its storage system, provides a cost-effective solution for storing and managing large amounts of data. MapReduce will provide powerful processing and analysis capabilities to extract valuable insights from the data. Input: CSV data with flat schema with multiple records and features.Link is given in main page Description: 1. STORAGE: Each Storage Node will store the data based on below condition. a. Mutually Exclusive feature data (column value) which is not common across records (rows): private node b. Feature data common in two records : 2-way shared node c. Feature data common in four records : 4 -way shared node . d. Feature data common in eight records: 8-way shared node. Note: Private node, 2,4,8- way shared nodes are storage nodes which stores feature values which are common in 2, 4, 8 records respectively. 2. METADATA Maintain record ID wise metadata about above storage deployments, which will explain how the feature values are stored across the storage nodes. The meta-data can be stored on a specific node. Big Data Systems Assignment 2 2 3. RETRIEVAL: For provided record ID, retrieval of record will refer step 2 to fetch all the required features (column values) from respective storage nodes to form the original record. NOTE: You can apply different techniques to understand the similarity of feature values like normalization, standardization, vectorization etc.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!