Question: 1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we

1. Random Sampling a data stream - 75 points The Problem Statement:

1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we can only iterate over once. Also, the data is expected to be very large, so while you are developing your program, you'd like to work on a statistically accurate representative sample. You would need to implement an algorithm that randomly chooses an item from the data stream such that each item is equally likely to be selected. The Algorithm: The algorithm for this problem is the Reservoir Sampling algorithm. http://en.wikipedia.org/wiki/Reservoir_sampling. An good simpler explanation and a Python-only implementation is shown here: https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know- 43c7bc11d17c (if you hit paywall for this link, retry it on an incognito window). The Data: data_Q1_2019.zip This dataset contains the actual daily SMART logs for all hard drives used in a data center during the first quarter of 2019. Note that over the course of the three months, some drives will fail and new one will come into use. There are over 900k entries in this dataset. SMART, https://en.wikipedia.org/wiki/Self-Monitoring, Analysis_and_Reporting_Technology TODO: Create a random subset of 50k entries of this data using Reservoir Sampling. (there are about 900K entries) Sample size k = 50,000 = 50k 1. Implement Reservoir Sampling in Hadoop MapReduce - 50 points 2. Implement Reservoir Sampling in Spark - 25 points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!