Question: 1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we

1. Random Sampling a data stream - 75 points The Problem Statement: Your data is a stream of items of unknown length that we can only iterate over once. Also, the data is expected to be very large, so while you are developing your program, you'd like to work on a statistically accurate representative sample. You would need to implement an algorithm that randomly chooses an item from the data stream such that each item is equally likely to be selected. The Algorithm: The algorithm for this problem is the Reservoir Sampling algorithm. http://en.wikipedia.org/wiki/Reservoir_sampling. An good simpler explanation and a Python-only implementation is shown here: https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know- 43c7bc11d17c (if you hit paywall for this link, retry it on an incognito window). The Data: data_Q1_2019.zip This dataset contains the actual daily SMART logs for all hard drives used in a data center during the first quarter of 2019. Note that over the course of the three months, some drives will fail and new one will come into use. There are over 900k entries in this dataset. SMART, https://en.wikipedia.org/wiki/Self-Monitoring, Analysis_and_Reporting_Technology TODO: Create a random subset of 50k entries of this data using Reservoir Sampling. (there are about 900K entries) Sample size k = 50,000 = 50k 1. Implement Reservoir Sampling in Hadoop MapReduce - 50 points 2. Implement Reservoir Sampling in Spark - 25 points

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Mirrlees Furniture earned $750,000 last year and had a 30% dividend payout ratio. How much did the firm add to its retained earnings? O a. $525,000 O b. $225,000 O c. $750,000 Od. $0

CANMNMM January of this year. (a) Each item will be held in a record. Describe all the data structures that must refer to these records to implement the required functionality. Describe all the...

Question: What as the average weekly safety inventory level of refined sugar from the beginning January 2022 to the end of July 2022? A. 512,465.9691 metric tons per week B. 316,002.1474 metric tons...

MID TERM 4points 1. For a sample of size 8, the critical values of chi-square for a 99% confidence interval for the population variance are a) b) c) d) 5,697.3572 5,629.2612 7,261.2500 6,262.2749 2...

Set Student Name: 1. Answer true or false for each part, and if false, explain your answer. a. The point estimate for the population mean, , of an x distribution is x-bar, computed from a random...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Business Research MethodologyQuestion Bank 1 1. When the marketing department of an organization attempts to determine the amount of time the managers in this department spend at their computers in...

Scanned by CamScanner Sampling and Sampling Distribution (Ch. 5 & 9) BA1605-07 (Sampling) 1 Polls BA1605-07 (Sampling) 2 Statistical Inference Involves Estimation Population? Hypothesis testing...

1 For this task, imagine that you were asked to present to a class of master's level students who are enrolled in their first quantitative research methods course. Create a PowerPoint presentation...

Business Research Methodology- Question Bank 1 1. When the marketing department of an organization attempts to determine the amount of time the managers in this department spend at their computers in...

Assume at the start of the day the open interest in a futures contract is 10,000 contracts. Assume during the course of the day, someone who is short decides to buy-to-close 2,000 contracts. On the...

Nike has just asked Aston Industries to supply 400,000 balls at a special-order price of $2.1 per ball. Nike wants Aston Industries to package the balls under the Nike label (Aston will imprint the...

After 1 9 6 0 , the number of banks decreased slightly, stabilizing at around 1 0 , 0 0 0 until the late 1 9 8 0 s . Question 8 options: TrueFalse

A solid-state fluoride ion-selective electrode responds to F - but not to HF. It also responds to hydroxide ion at high concentration when . Suppose that such an electrode gave a potential of +100 mV...

An industrial load consumes 100 kW at 0.8 pf lagging. If an ammeter in the transmission line indicates that the load current is 284 A rms, find the load voltage.

After extensive research and development, Goodweek Tires, Inc., has recently developed a new tire, the SuperTread, and must decide whether to make the investment necessary to produce and market it....

The following list gives the measured breaking force in newtons for a sample of 60 pieces of certain type of cord. Plot the absolute frequency histogram. Try bin widths of 10, 30, and 50 N. Which...

1Given the waveform shown, determine which of the trigonometric Fourier coefficients have zero value, which have nonzero value and why.

Determine if the overhead allocated to the product relates to a single plantwide overhead rate method, multiple production department factory overhead rate method, or activity-based costing...

10.18. Consider a design. Suppose after running the experiment, the largest observed effects are A % BD, B % AD, and D % AB. You wish to augment the original design with a group of four runs to...

10.16. Weighted least squares. Suppose that we are fitting the straight line y $ "0 % "1x % ', but the variance of the ys now depends on the level of x; that is, where the wi are known constants,...

10.17. Consider the design discussed in Example 10.5. (a) Suppose you elect to augment the design with the single run selected in that example. Find the variances and covariances of the regression...