Question: Using all the data need not be an enormous task. Big data is not necessarily big in absolute terms, although often it is. Google Flu
Using all the data need not be an enormous task. Big data is not necessarily big in absolute terms, although often it is. Google Flu Trends tunes its predictions on hundreds of millions of mathematical modeling exercises using billions of data points. The full sequence of a human genome amounts to three billion base pairs. But the absolute number of data points alone, the size of the dataset, is not what makes these examples of big data. What classifies them as big data is that instead of using the shortcut of a random sample, both Flu Trends and Steve Jobss doctors used as much of the entire dataset as feasible. The discovery of match fixing in Japans national sport, sumo wrestling, is a good illustration of why using N=all need not mean big. Thrown matches have been a constant accusation bedeviling the sport of emperors, and always rigorously denied. Steven Levitt, an economist at the University of Chicago, looked for corruption in the records of more than a decade of past matchesall of them. In a delightful research paper published in the American Economic Review and reprised in the book Freakonomics, he and a colleague described the usefulness of examining so much data. They analyzed 11 years worth of sumo bouts, more than 64,000 wrestler-matches, to hunt for anomalies. And they struck gold. Match fixing did indeed take place, but not where most people suspected. Rather than for championship bouts, which may or may not be rigged, the data showed that something funny was happening during the unnoticed end-of-tournament matches. It seems little is at stake, since the wrestlers have no chance of winning a title. But one peculiarity of sumo is that wrestlers need a majority of wins at the 15-match tournaments in order to retain their rank and income. This sometimes leads to asymmetries of interests, when a wrestler with a 77 record faces an opponent with 86 or better. The outcome means a great deal to the first wrestler and next to nothing to the second. In such cases, the number-crunching uncovered, the wrestler who needs the victory is very likely to win. Might the fellows who need the win be fighting more resolutely? Perhaps. But the data suggested that something else is happening as well. The wrestlers with more at stake win about 25 percent more often than normal. Its hard to attribute that large a discrepancy to adrenaline alone. When the data was parsed further, it showed that the very next time the same two wrestlers met, the loser of the previous bout was much more likely to win than when they sparred in later matches. So the first victory appears to be a gift from one competitor to the other, since what goes around comes around in the tight-knit world of sumo. This information was always apparent. It existed in plain sight. But random sampling of the bouts might have failed to reveal it. Even though it relied on basic statistics, without knowing what to look for, one would have no idea what sample to use. In contrast, Levitt and his colleague uncovered it by using a far larger set of datastriving to examine the entire universe of matches. An investigation using big data is almost like a fishing expedition: it is unclear at the outset not only whether one will catch anything but what one may catch. The dataset need not span terabytes. In the sumo case, the entire dataset contained fewer bits than a typical digital photo these days. But as big-data analysis, it looked at more than a typical random sample. When we talk about big data, we mean big less in absolute than in relative terms: relative to the comprehensive set of data. For a long time, random sampling was a good shortcut. It made analysis of large data problems possible in the pre-digital era. But much as when converting a digital image or song into a smaller file, information is lost when sampling. Having the full (or close to the full) dataset provides a lot more freedom to explore, to look at the data from different angles or to look closer at certain aspects of it. A fitting analogy may be the Lytro camera, which captures not just a single plane of light, as with conventional cameras, but rays from the entire light field, some 11 million of them. The photographers can decide later which element of an image to focus on in the digital file. There is no need to focus at the outset, since collecting all the information makes it possible to do that afterwards. Because rays from the entire light field are included, it is closer to all the data. As a result, the information is more reuseable than ordinary pictures, where the photographer has to decide what to focus on before she presses the shutter. Similarly, because big data relies on all the information, or at least as much as possible, it allows us to look at details or explore new analyses without the risk of blurriness. We can test new hypotheses at many levels of granularity. This quality is what lets us see match fixing in sumo wrestling, track the spread of the flu virus by region, and fight cancer by targeting a precise portion of the patients DNA. It allows us to work at an amazing level of clarity. To be sure, using all the data instead of a sample isnt always necessary. We still live in a resource-constrained world. But in an increasing number of cases using all the data at hand does make sense, and doing so is feasible now where before it was not. One of the areas that is being most dramatically shaken up by N=all is the social sciences. They have lost their monopoly on making sense of empirical social data, as big-data analysis replaces the highly skilled survey specialists of the past. The social science disciplines largely relied on sampling studies and questionnaires. But when the data is collected passively while people do what they normally do anyway, the old biases associated with sampling and questionnaires disappear. We can now collect information that we couldnt before, be it relationships revealed via mobile phone calls or sentiments unveiled through tweets. More important, the need to sample disappears. Albert-Lszl Barabsi, one of the worlds foremost authorities on the science of network theory, wanted to study interactions among people at the scale of the entire population. So he and his colleagues examined anonymous logs of mobile phone calls from a wireless operator that served about one-fifth of an unidentified European countrys populationall the logs for a four-month period. It was the first network analysis on a societal level, using a dataset that was in the spirit of N=all. Working on such a large scale, looking at all the calls among millions of people over time, produced novel insights that probably couldnt have been revealed in any other way. Intriguingly, in contrast to smaller studies, the team discovered that if one removes people from the network who have many links within their community, the remaining social network degrades but doesnt fail. When, on the other hand, people with links outside their immediate community are taken off the network, the social net suddenly disintegrates, as if its structure had buckled. It was an important, but somewhat unexpected result. Who would have thought that the people with lots of close friends are far less important to the stability of the network structure than the ones who have ties to more distant people? It suggests that there is a premium on diversity within a group and in society at large. We tend to think of statistical sampling as some sort of immutable bedrock, like the principles of geometry or the laws of gravity. But the concept is less than a century old, and it was developed to solve a particular problem at a particular moment in time under specific technological constraints. Those constraints no longer exist to the same extent. Reaching for a random sample in the age of big data is like clutching at a horse whip in the era of the motor car. We can still use sampling in certain contexts, but it need notand will notbe the predominant way we analyze large datasets. Increasingly, we will aim to go for it all.
Will you help me explain what is this whole entire paragraphs are talking about?
what is the issue? Please specify what's happening! Thank you!
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
