In 1936, Literary Digest magazine made completely wrong predictions about the Presidential election. They did this because they polled based on a bad sample: driver’s licenses and subscriptions to their own magazine. Enough people who didn’t drive or subscribe to Literary Digest voted, and they voted for Roosevelt. The magazine’s editors’ faces were red, and they had the humility to put that on the cover.
This year, the 538 website made completely wrong predictions about the Presidential election, and its editor, Nate Silver, sorta kinda took responsibility. He had put too much trust in polls conducted at the state level. They were not representative of the full spectrum of voter opinion in those states, and this had skewed his predictions.
Silver’s face should be redder than that, because he said that his conclusions were tentative, but he did not act like it. When your results are so unreliable and your data is so problematic, you have no business being on television and in front-page news articles as much as Silver has.
In part this attitude of Silver’s comes from the worldview of sports betting, where the gamblers know they want to bet and the only question is which team they should put their money on. There is some hedging, but not much. Democracy is not a gamble, and people need to be prepared for all outcomes.
But the practice of blithely making grandiose claims based on unrepresentative data, while mouthing insincere disclaimers, goes far beyond election polling. It is widespread in the social sciences, and I see it all the time in linguistics and transgender studies. It is pervasive in the relatively new field of Data Science, and Big Data is frequently Non-representative Data.
At the 2005 meeting of the American Association for Corpus Linguistics there were two sets of interactions that stuck with me and have informed my thinking over the years. The first was a plenary talk by the computer scientist Ken Church. He described in vivid terms the coming era of cheap storage and bandwidth, and the resulting big data boom.
But Church went awry when he claimed that the size of the datasets available, and the computational power to analyze them, would obviate the need for representative samples. It is true that if you can analyze everything you do not need a sample. But that’s not the whole story.
A day before Church’s talk I had had a conversation over lunch with David Lee, who had just written his dissertation on the sampling problems in the British National Corpus. Lee had reiterated what I had learned in statistics class: if you simply have most of the data but your data is incomplete in non-random ways, you have a biased sample and you can’t make generalizations about the whole.
I’ve seen this a lot in the burgeoning field of Data Science. There are too many people performing analyses they don’t understand on data that’s not representative, making unfounded generalizations. As long as these generalizations fit within the accepted narratives, nobody looks twice.
We need to stop making it easier to run through the steps of data analysis, and instead make it easier to get those steps right. Especially sampling. Or our faces are going to be red all the time.