Data science and data technology

The big buzz over the past few years has been Data Science. Corporations are opening Data Science departments and staffing them with PhDs, and universities have started Data Science programs to sell credentials for these jobs. As a linguist I’m particularly interested in this new field, because it includes research practices that I’ve been using for years, like corpus linguistics and natural language processing.

As a scientist I’m a bit skeptical of this field, because frankly I don’t see much science. Sure, the practitioners have labs and cool gadgets. But I rarely see anyone asking hard questions, doing careful observations, creating theories, formulating hypotheses, testing the hypotheses and examining the results.

The lack of careful observation and skeptical questioning is what really bothers me, because that’s what’s at the core of science. Don’t get me wrong: there are plenty of people in Data Science doing both. But these practices should permeate a field with this name, and they don’t.

If there’s so little science, why do we call it “science”? A glance through some of the uses of the term in the Google Books archive suggests that it was first used in the late twentieth century it did include hypothesis testing. In the early 2000s people began to use it as a synonym for “big data,” and I can understand why. “Big data” was a well-known buzzword associated with Silicon Valley tech hype.

I totally get why people replaced “big data” with “data science.” I’ve spent years doing science (with observations, theories, hypothesis testing, etc.). Occasionally I’ve been paid for doing science or teaching it, but only part time. Even after getting a PhD I had to conclude that science jobs that pay a living wage are scarce and in high demand, and I was probably not going to get one.

It was kind of exciting when I got a job with Scientist in the title. It helped to impress people at parties. At first it felt like a validation of all the time I spent learning how to do science. So I completely understand why people prefer to say they’re doing “data science” instead of “big data.”

The problem with being called a Scientist in that job was that I wasn’t working on experiments. I was just helping people optimize their tools. Those tools could possibly be used for science, but that was not why we were being paid to develop them. We have a word for a practice involving labs and gadgets, without requiring any observation or skepticism. That word is not science, it’s technology.

Technology is perfectly respectable; it’s what I do all day. For many years I’ve been well paid to maintain and expand the technology that sustains banks, lawyers, real estate agents, bakeries and universities. I’m currently building tools that help instructors at Columbia University with things like memorizing the names of their students and sending them emails. It’s okay to do technology. People love it.

If you really want to do science and you’re not one of the lucky ones, you can do what I do: I found a technology job that doesn’t demand all my time. Once in a while they need me to stay late or work on a weekend, but the vast majority of my time outside of 9-5 is mine. I spend a lot of that time taking care of my family and myself, and relaxing with friends. But I have time to do science.

I just have to outrun your theory

The Problem

You’ve probably heard the joke about the two people camping in the woods who encounter a hungry predator. One person stops to put on running shoes. The other says, “Why are you wasting time? Even with running shoes you’re not going to outrun that animal!” The other replies, “I don’t have to outrun the animal, I just have to outrun you.”

For me this joke highlights a problem with the way some people argue about climate change. First of all, spreading uncertainty and doubt against competitors is a common marketing tactic, and as Naomi Orestes and Erik Conway documented in their book Merchants of Doubt, that same tactic has been used by marketers against concerns about smoking, DDT, acid rain and most recently climate change.

In the case of climate change, as with fundamentalist criticisms of evolution, there is a lot of stress on the idea that the climatic models are “only a theory,” and that they leave room for the possibility of error. The whole idea is to deter a certain number of influential people from taking action.

That Bret Stephens Column

The latest example is Bret Stephens, newly hired as an opinion columnist by New York Times editors who should really have known better. Stephens’s first column is actually fine on the surface, as far as it goes, aside from some factual errors: never trust anyone who claims to be 100% certain about anything. Most people know this, so if you claim to be 100% certain, you may wind up alienating some potential allies. And he doesn’t go beyond that; I re-read it several times in case I missed anything.

Since all Stephens did was to say those two things, none of which amount to an actual critique of climate change or an argument that we should not act, the intensely negative reactions it generated may be a little surprising. But it helps if you look back at Stephens’s history and see that he’s written more or less the same thing over and over again, at the Wall Street Journal and other places.

Many of the responses to Stephens’s column have pointed out that if there’s any serious chance of climate change having the effects that have been predicted, we should do something about it. The logical next step is talking about possible actions. Stephens hasn’t talked about any possible actions in over fifteen years, which is pretty solid evidence of concern trolling: he pretends to be offering constructive criticism while having no interest in actually doing anything constructive. And if you go all the way back to a 2002 column in the Jerusalem Post, you can see that he was much more overtly critical in the past.

Stephens is very careful not to recommend any particular course of action, but sometimes he hints at the potential costs of following recommendations based on the most widely accepted climate change models. Underlying all his columns is the implication that the status quo is just fine: Stephens doesn’t want to do anything to reduce carbon emissions. He wants us to keep mining coal, pumping oil and driving everywhere in single-occupant vehicles.

People are correctly discerning Stephens’s intent: to spread confusion and doubt, disrupting the consensus on climate change and providing cover for greedy polluters and ideologues of happy motoring. But they play into his trap, responding in ways that look repressive, inflexible and intolerant. In other words, Bret Stephens is the Milo Yiannopoulos of climate change.

The weak point of mainstream science

Stephens’s trolling is particularly effective because he exploits a weakness in the way mainstream scientists handle theories. In science, hypotheses are predictions that can be tested and found to be true or false: the hypothesis that you can sail around the world was confirmed when Juan Sebastián Elcano completed Magellan’s expedition.

Many people view scientific theories as similarly either true or false. Those that are true – complete and consistent models of reality – are valid and useful, but those that are false are worthless. For them, Galileo’s measurements of the movements of the planets demonstrated that the heliocentric model of the solar system is true and the model with the earth at the center is false.

In this all-or-nothing view of science, uncertainty is death. If there is any doubt about a theory, it has not been completely proven, and is therefore worthless for predicting the future and guiding us as we decide what to do.

Trolls like Bret Stephens and the Marshall Institute exploit this intolerance of uncertainty by playing up any shred of doubt about climate change. And there are many such doubts, because this is actually the way science is supposed to work: highlighting uncertainty and being cautious about results. Many people respond to them in the most unscientific ways, by downplaying doubts and pointing to the widespread belief in climate change among scientists.

The all-or-nothing approach to theories is actually a betrayal of the scientific method. The caution built into the gathering of scientific evidence was not intended as a recipe for paralysis or preparation for popularity contests. There is a way to use cautious reports and uncertain models as the basis for decisive action.

The instrumental approach

This approach to science is called instrumentalism, and its core principles are simple: theories are never true or false. Instead, they are tools for understanding and prediction. A tool may be more effective than another tool for a specific purpose, but it is not better in any absolute sense.

In an instrumentalist view, when we find fossils that are intermediate between species it does not demonstrate that evolution is true and creation is false. Instead, it demonstrates that evolution is a better predictor of what we will find underground, and produces more satisfying explanations of fossils.

Note that when we evaluate theories from an instrumental perspective, it is always relative to other theories that might also be useful for understanding and predicting the same data. Like the two people running from the wild animal, we are not comparing theories against some absolute standard of truth, but against each other.

In climate change, instrumentalism simply says that certain climate models have been better than others at predicting the rising temperature readings and melting glaciers we have seen recently. These models suggest that it is all the driving we’re doing and the dirty power plants we’re running that are causing these rising temperatures, and to reduce the dangers from rising temperatures we need to reconfigure our way of living around walking and reducing our power consumption.

Evaluating theories relative to each other in this way takes all the bite out of Bret Stephens’s favorite weapon. He never makes it explicit, but he does have a theory: that we’re not doing much to raise the temperature of the planet. If we make his theory explicit and evaluate it against the best climate change models, it sucks. It makes no sense of the melting glaciers and rising tides, and has done a horrible job of predicting climate readings.

We can fight against Bret Stephens and his fellow merchants of doubt. But in order to do that, we need to set aside our greatest weakness: the belief that theories can be true, and must be proven true to be the basis for action. We don’t have to outrun Stephens’s uncertainty; we just have to outrun his love of the status quo. And instrumentalism is the pair of running shoes we need to do that.

Is your face red?

In 1936, Literary Digest magazine made completely wrong predictions about the Presidential election. They did this because they polled based on a bad sample: driver’s licenses and subscriptions to their own magazine. Enough people who didn’t drive or subscribe to Literary Digest voted, and they voted for Roosevelt. The magazine’s editors’ faces were red, and they had the humility to put that on the cover.

This year, the 538 website made completely wrong predictions about the Presidential election, and its editor, Nate Silver, sorta kinda took responsibility. He had put too much trust in polls conducted at the state level. They were not representative of the full spectrum of voter opinion in those states, and this had skewed his predictions.

Silver’s face should be redder than that, because he said that his conclusions were tentative, but he did not act like it. When your results are so unreliable and your data is so problematic, you have no business being on television and in front-page news articles as much as Silver has.

In part this attitude of Silver’s comes from the worldview of sports betting, where the gamblers know they want to bet and the only question is which team they should put their money on. There is some hedging, but not much. Democracy is not a gamble, and people need to be prepared for all outcomes.

But the practice of blithely making grandiose claims based on unrepresentative data, while mouthing insincere disclaimers, goes far beyond election polling. It is widespread in the social sciences, and I see it all the time in linguistics and transgender studies. It is pervasive in the relatively new field of Data Science, and Big Data is frequently Non-representative Data.

At the 2005 meeting of the American Association for Corpus Linguistics there were two sets of interactions that stuck with me and have informed my thinking over the years. The first was a plenary talk by the computer scientist Ken Church. He described in vivid terms the coming era of cheap storage and bandwidth, and the resulting big data boom.

But Church went awry when he claimed that the size of the datasets available, and the computational power to analyze them, would obviate the need for representative samples. It is true that if you can analyze everything you do not need a sample. But that’s not the whole story.

A day before Church’s talk I had had a conversation over lunch with David Lee, who had just written his dissertation on the sampling problems in the British National Corpus. Lee had reiterated what I had learned in statistics class: if you simply have most of the data but your data is incomplete in non-random ways, you have a biased sample and you can’t make generalizations about the whole.

I’ve seen this a lot in the burgeoning field of Data Science. There are too many people performing analyses they don’t understand on data that’s not representative, making unfounded generalizations. As long as these generalizations fit within the accepted narratives, nobody looks twice.

We need to stop making it easier to run through the steps of data analysis, and instead make it easier to get those steps right. Especially sampling. Or our faces are going to be red all the time.

Sampling is a labor-saving device

Last month I wrote those words on a slide I was preparing to show to the American Association for Corpus Linguistics, as a part of a presentation of my Digital Parisian Stage Corpus. I was proud of having a truly representative sample of theatrical texts performed in Paris between 1800 and 1815, and thus finding a difference in the use of negation constructions that was not just large but statistically significant. I wanted to convey the importance of this.

I was thinking about Laplace finding the populations of districts “distributed evenly throughout the Empire,” and Student inventing his t-test to help workers at the Guinness plants determine the statistical significance of their results. Laplace was not after accuracy, he was going for speed. Student was similarly looking for the minimum amount of effort required to produce an acceptable level of accuracy. The whole point was to free resources up for the next task.

I attended one paper at the conference that gave p-values for all its variables, and they were all 0.000. After that talk, I told the student who presented that those values indicated he had oversampled, and he should have stopped collecting data much sooner. “That’s what my advisor said too,” he said, “but this way we’re likely to get statistical significance for other variables we might want to study.”

The student had a point, but it doesn’t seem very – well, “agile” is a word I’ve been hearing a lot lately. In any case, as the conference was wrapping up, it occurred to me that I might have several hours free – on my flight home and before – to work on my research.

My initial impulse was to keep doing what I’ve been doing for the past couple of years: clean up OCRed text and tag it for negation. Then it occurred to me that I really ought to take my own advice. I had achieved statistical significance. That meant it was time to move on!

I have started working on the next chunk of the nineteenth century, from 1816 through 1830. I have also been looking into other variables to examine. I’ve got some ideas, but I’m open to suggestions. Send them if you have them!


At the beginning of June I participated in the Trees Count Data Jam, experimenting with the results of the census of New York City street trees begun by the Parks Department in 2015. I had seen a beta version of the map tool created by the Parks Department’s data team that included images of the trees pulled from the Google Street View database. Those images reminded me of others I had seen in the @everylotnyc twitter feed.

@everylotnyc is a Twitter bot that explores the City’s property database. It goes down the list in order by taxID number. Every half hour it compose a tweet for a property, consisting of the address, the borough and the Street View photo. It seems like it would be boring, but some people find it fascinating. Stephen Smith, in particular, has used it as the basis for some insightful commentary.

It occurred to me that @everylotnyc is actually a very powerful data visualization tool. When we think of “big data,” we usually think of maps and charts that try to encompass all the data – or an entire slice of it. The winning project from the Trees Count Data Jam was just such a project: identifying correlations between cooler streets and the presence of trees.

Social scientists, and even humanists recently, fight over quantitative and qualitative methods, but the fact is that we need them both. The ethnographer Michael Agar argues that distributional claims like “5.4 percent of trees in New York are in poor condition” are valuable, but primarily as a springboard for diving back into the data to ask more questions and answer them in an ongoing cycle. We also need to examine the world in detail before we even know which distributional questions to ask.

If our goal is to bring down the percentage of trees in Poor condition, we need to know why those trees are in Poor condition. What brought their condition down? Disease? Neglect? Pollution? Why these trees and not others?

Patterns of neglect are often due to the habits we develop of seeing and not seeing. We are used to seeing what is convenient, what is close, what is easy to observe, what is on our path. But even then, we develop filters to hide what we take to be irrelevant to our task at hand, and it can be hard to drop these filters. We can walk past a tree every day and not notice it. We fail to see the trees for the forest.

Privilege filters our experience in particular ways. A Parks Department scientist told me that the volunteer tree counts tended to be concentrated in wealthier areas of Manhattan and Brooklyn, and that many areas of the Bronx and Staten Island had to be counted by Parks staff. This reflects uneven amounts of leisure time and uneven levels of access to city resources across these neighborhoods, as well as uneven levels of walkability.

A time-honored strategy for seeing what is ordinarily filtered out is to deviate from our usual patterns, either with a new pattern or with randomness. This strategy can be traced at least as far as the sampling techniques developed by Pierre-Simon Laplace for measuring the population of Napoleon’s empire, the forerunner of modern statistical methods. Also among Laplace’s cultural heirs are the flâneurs of late nineteenth-century Paris, who studied the city by taking random walks through its crowds, as noted by Charles Baudelaire and Walter Benjamin.

In the tradition of the flâneurs, the Situationists of the mid-twentieth century highlighted the value of random walks, that they called dérives. Here is Guy Debord (1955, translated by Ken Knabb):

The sudden change of ambiance in a street within the space of a few meters; the evident division of a city into zones of distinct psychic atmospheres; the path of least resistance which is automatically followed in aimless strolls (and which has no relation to the physical contour of the ground); the appealing or repelling character of certain places — these phenomena all seem to be neglected. In any case they are never envisaged as depending on causes that can be uncovered by careful analysis and turned to account. People are quite aware that some neighborhoods are gloomy and others pleasant. But they generally simply assume that elegant streets cause a feeling of satisfaction and that poor streets are depressing, and let it go at that. In fact, the variety of possible combinations of ambiances, analogous to the blending of pure chemicals in an infinite number of mixtures, gives rise to feelings as differentiated and complex as any other form of spectacle can evoke. The slightest demystified investigation reveals that the qualitatively or quantitatively different influences of diverse urban decors cannot be determined solely on the basis of the historical period or architectural style, much less on the basis of housing conditions.

In an interview with Neil Freeman, the creator of @everylotbot, Cassim Shepard of Urban Omnibus noted the connections between the flâneurs, the dérive and Freeman’s work. Freeman acknowledged this: “How we move through space plays a huge and under-appreciated role in shaping how we process, perceive and value different spaces and places.”

Freeman did not choose randomness, but as he describes it in a tinyletter, the path of @everylotbot sounds a lot like a dérive:

@everylotnyc posts pictures in numeric order by Tax ID, which means it’s posting pictures in a snaking line that started at the southern tip of Manhattan and is moving north. Eventually it will cross into the Bronx, and in 30 years or so, it will end at the southern tip of Staten Island.

Freeman also alluded to the influence of Alfred Korzybski, who coined the phrase, “the map is not the territory”:

Streetview and the property database are both a widely used because they’re big, (putatively) free, and offer a completionist, supposedly comprehensive view of the world. They’re also both products of people working within big organizations, taking shortcuts and making compromises.

I was not following @everylotnyc at the time, but I knew people who did. I had seen some of their retweets and commentaries. The bot shows us pictures of lots that some of us have walked past hundreds of times, but seeing it in our twitter timelines makes us see it fresh again and notice new things. It is the property we know, and yet we realize how much we don’t know it.

When I thought about those Street View images in the beta site, I realized that we could do the same thing for trees for the Trees Count Data Jam. I looked, and discovered that Freeman had made his code available on Github, so I started implementing it on a server I use. I shared my idea with Timm Dapper, Laura Silver and Elber Carneiro, and we formed a team to make it work by the deadline.

It is important to make this much clear: @everytreenyc may help to remind us that no census is ever flawless or complete, but it is not meant as a critique of the enterprise of tree counts. Similarly, I do not believe that @everylotnyc was meant as an indictment of property databases. On the contrary, just as @everylotnyc depends on the imperfect completeness of the New York City property database, @everytreenyc would not be possible without the imperfect completeness of the Trees Count 2015 census.

Without even an attempt at completeness, we could have no confidence that our random dive into the street forest was anything even approaching random. We would not be able to say that following the bot would give us a representative sample of the city’s trees. In fact, because I know that the census is currently incomplete in southern and eastern Queens, when I see trees from the Bronx and Staten Island and Astoria come up in my timeline I am aware that I am missing the trees of southeastern Queens, and awaiting their addition to the census.

Despite that fact, the current status of the 2015 census is good enough for now. It is good enough to raise new questions: what about that parking lot? Is there a missing tree in the Street View image because the image is newer than the census, or older? It is good enough to continue the cycle of diving and coming up, of passing through the funnel and back up, of moving from quantitative to qualitative and back again.

Quantitative needs qualitative, and vice versa

Data Science is all the rage these days. But this current craze focuses on a particular kind of data analysis. I conducted an informal poll as an icebreaker at a recent data science party, and most of the people I talked to said that it wasn’t data science if it didn’t include machine learning. Companies in all industries have been hiring “quants” to do statistical modeling. Even in the humanities, “distant reading” is a growing trend.

There has been a reaction to this, of course. Other humanists have argued for the continued value of close reading. Some companies have been hiring anthropologists and ethnographers. Academics, journalists and literary critics regularly write about the importance of nuance and empathy.

For years, my response to both types of arguments has been “we need both!” But this is not some timid search for a false balance or inclusion. We need both close examination and distributional analysis because the way we investigate the world depends on both, and both depend on each other.

I learned this from my advisor Melissa Axelrod, and a book she assigned me for an independent study on research methods. The Professional Stranger is a guide to ethnographic field methods, but also contains some commentary on the nature of scientific inquiry, and mixes its well-deserved criticism of quantitative social science with a frank acknowledgment of the interdependence of qualitative and quantitative methods. On Page 134 he discusses Labov’s famous study of /r/-dropping in New York City:

The catch, of course, is that he would never have known which variable to look at without the blood, sweat and tears of previous linguists who had worked with a few informants and identified problems in the linguistic structure of American English. All of which finally brings us to the point of this example traditional ethnography struggles mightily with the existence of pattern among the few.

Labov acknowledges these contributions in Chapter 2 of his 1966 book: Babbitt (1896), Thomas (1932, 1942, 1951), Kurath (1949, based on interviews by Guy S. Lowman), Hubbell (1950) and Bronstein (1962). His work would not be possible without theirs, and their work was incomplete until he developed a theoretical framework to place their analysis in, and tested that framework with distributional surveys.

We’ve all seen what happens when people try to use one of these methods without the other. Statistical methods that are not grounded in close examination of specific examples produce surveys that are meaningless to the people who take them and uninformative to scientists. Qualitative investigations that are not checked with rigorous distributional surveys produce unfounded, misleading generalizations. The worst of both worlds are quantitative surveys that are neither broadly grounded in ethnography nor applied to representative samples.

It’s also clear in Agar’s book that qualitative and quantitative are not a binary distinction, but rather two ends of a continuum. Research starts with informal observations about specific things (people, places, events) that give rise to open-ended questions. The answers to these questions then provoke more focused questions that are asked of a wider range of things, and so on.

The concepts of broad and narrow, general and specific, can be confusing here, because at the qualitative, close or ethnographic end of the spectrum the questions are broad and general but asked about a narrow, specific set of subjects. At the quantitative, distant or distributional end of the spectrum the questions are narrow and specific, but asked of a broad, general range of subjects. Agar uses a “funnel” metaphor to model how the questions narrow during this progression, but he could just as easily have used a showerhead to model how the subjects broaden at the same time.

The progression is not one-way, either. The findings of a broad survey can raise new questions, which can only be answered by a new round of investigation, again beginning with qualitative examination on a small scale and possibly proceeding to another broad survey. This is one of the cycles that increase our knowledge.

Rather than the funnel metaphor, I prefer a metaphor based on seeing. Recently I’ve been re-reading The Omnivore’s Dilemma, and in Chapter 8 Michael Pollan talks about taking a close view of a field of grass:

In fact, the first time I met Salatin he’d insisted that even before I met any of his animals, i get down on my belly in this very pasture to make the acquaintance of the less charismatic species his farm was nurturing that, in turn, were nurturing his farm.

Pollan then gets up from the grass to take a broader view of the pasture, but later bends down again to focus on individual cows and plants. He does this metaphorically throughout the book, as many great authors do: focusing in on a specific case, then zooming out to discuss how that case fits in with the bigger picture. Whether he’s talking about factory-farmed Steer 534, or Budger the grass-fed cow, or even the thousands of organic chickens that are functionally nameless under the generic name of “Rosie,” he dives into specific details about the animals, then follows up by reporting statistics about these farming methods and the animals they raise.

The bottom line is that we need studies from all over the qualitative-quantitative spectrum. They build on each other, forming a cycle of knowledge. We need to fund them all, to hire people to do them all, and to promote and publish them all. If you do it right, the plural of anecdote is indeed data, and you can’t have data without anecdotes.

Why I probably won’t take your survey

I wrote recently that if you want to be confident in generalizing observations from a sample to the entire population, your sample needs to be representative. But maybe you’re skeptical. You might have noticed that a lot of people don’t pay much attention to representativeness, and somehow there are hardly any consequences for them. But that doesn’t mean that there are never consequences, for them or other people.

In the “hard sciences,” sampling can be easier. Unless there is some major impurity, a liter of water from New York usually has the same properties as one from Buenos Aires. If you’re worried about impurities you can distill the samples to increase the chance that they’re the same. Similarly, the commonalities in a basalt column or a wheel often outweigh any variation. A pigeon in New York is the same as one in London, right? A mother in New York is the same as a mother in Buenos Aires

Well, maybe. As we’ve seen, a swan in New York can be very different from a swan in Sydney. And when we get into the realm of social sciences, things get more complex and the complexity gets hard to avoid. There are probably more differences between a mother in New York and one in Buenos Aires than for pigeons or stones or water, and the differences are more important to more people.

This is not just speculation based on rigid rules about sampling. As Bethany Brookshire wrote last year, psychologists are coming to realize the drawbacks of building so much of their science around WEIRD people. And when she says WEIRD, she means WEIRD like me: White, Educated and from an Industrialized, Rich, Democratic country. And not just any WEIRD people, but college sophomores. Brookshire points out how much that skews the results in a particular study of virginity, but she also links to a review by Heinrich, Heine and Norenzayan (2010) that examines several studies and concludes that “members of WEIRD societies, including young children, are among the least representative populations one could find for generalizing about humans.”

I think about this whenever I get an invitation to participate in a social science study. I get them pretty frequently, probably at least twice a week, on email lists and Twitter, and occasionally Tumblr and even Facebook. Often they’re directly from the researchers themselves: “Native English speakers, please fill out my questionnaire on demonstratives!” That means that they’re going primarily to a population of educated people, most of whom are white from an industrialized, rich, democratic country.

(A quick reminder, in case you just tuned in: This applies to universal observations – percentages, averages and all or none statements. It does not apply to existential statements, where you simply say that you found ten people who say “less apples.” You take those wherever you find them, as long as they’re reliable sources.)

I don’t have a real problem with using non-representative samples for pilot studies. You have a hunch about something, you want to see if it’s not just you before you spend a lot of time sending a survey out to people you don’t know. I have a huge problem with it being used for anything that’s published in a peer-reviewed journal or disseminated in the mainstream media. And yeah, that means I have a huge problem with just about any online dialect survey.

I also don’t like the idea of students generalizing universal observations from non-representative online surveys for their term papers and master’s theses. People learn skills by doing. If they get practice taking representative samples, they’ll know how to do that. If they get practice making qualitative, existential observations, they’ll be able to do those. If they spend their time in school making unfounded generalizations from unrepresentative samples (with a bit of handwaving boilerplate, of course!), most of them will keep doing that after they graduate.

So that’s my piece. I’m actually going to keep relatively quiet about this because some of the people who do those studies (or their friends) might be on hiring committees, but I do want to at least register my objections here. And if you’re wondering why I haven’t filled out your survey, or even forwarded it to all my friends, this is your answer.

You can’t get significance without a representative sample

Recently I’ve talked about the different standards for existential and universal claims, how we can use representative samples to estimate universal claims, and how we know if our representative sample is big enough to be “statistically significant.” But I want to add a word of caution to these tests: you can’t get statistical significance without a representative sample.

If you work in social science you’ve probably seen p-values reported in studies that aren’t based on representative samples. They’re probably there because the authors took one required statistics class in grad school and learned that low p-values are good. It’s quite likely that these p-values were actually expected, if not explicitly requested, by the editors or reviewers of the article, who took a similar statistics class. And they’re completely useless.

P-values tell you whether your observation (often a mean, but not always) is based on a big enough sample that you can be 99% (or whatever) sure it’s not the luck of the draw. You are clear to generalize your representative sample to the entire population. But if your sample is not representative, it doesn’t matter!

Suppose you need 100% pure Austrian pumpkin seed oil, and you tell your friend to make sure he gets only the 100% pure kind. Your friend brings you 100% pure Australian tea tree oil. They’re both oils, and they’re both 100% pure, so your friend doesn’t understand why you’re so frustrated with him. But purity is irrelevant when you’ve got the wrong oil. P-values are the same way.

So please, don’t report p-values if you don’t have a representative sample. If the editor or reviewer insists, go ahead and put it in, but please roll your eyes while you’re running your t-tests. But if you are the editor or reviewer, please stop asking people for p-values if they don’t have a representative sample! Oh, and you might want to think about asking them to collect a representative sample…

How big a sample do you need?

In my post last week I talked about the importance of representative samples for making universal statements, including averages and percentages. But how big should your sample be? You don’t need to look at everything, but you probably need to look at more than one thing. How big a sample do you need in order to be reasonably sure of your estimates?

One of the pioneers in this area was a mysterious scholar known only to the public as Student. He took that name because he had been a student of the statistician Karl Pearson, and because he was generally a modest person. After his death, he was revealed to be William Sealy Gosset, Head Brewer for the Guinness Brewery. He had published his findings (PDF) under a pseudonym so that the competing breweries would not realize the relevance of his work to brewing.

Pearson had connected sampling to probability, because for every item sampled there is a chance that it is not a good example of the population as a whole. He used the probability integral transformation, which required relatively large samples. Pearson’s preferred application was biometrics, where it was relatively easy to collect samples and get a good estimate of the probability integral.

The Guinness brewery was experimenting with different varieties of barley, looking for ones that would yield the most grain for brewing. The cost of sampling barley added up over time, and the number of samples that Pearson used would have been too expensive. Student’s t-test saved his employer money by making it easy to tell whether they had the minimum sample size that they needed for good estimates.

Both Pearson’s and Student’s methods resulted in equations and tables that allowed people to estimate the probability that the mean of their sample is inaccurate. This can be expressed as a margin of error or as a confidence interval, or as the p-value of the mean. The p-value depends on the number of items in your sample, and how much they vary from each other. The bigger the sample and the smaller the variance, the smaller the p-value. The smaller the p-value, the more likely it is that your sample mean is close to the actual mean of the population you’re interested in. For Student, a small p-value meant that the company didn’t have to go out and test more barley crops.

Before you gather your sample, you decide how much uncertainty you’re willing to tolerate, in other words, a maximum p-value designated by α (alpha). When a sample’s p-value is lower than the α-value, it is said to be significant. One popular α-value is 0.05, but this is often decided collectively, and enforced by journal editors and thesis advisors who will not accept an article where the results don’t meet their standards for statistical significance.

The tests of significance determined by Pearson, Student, Ronald Fisher and others are hugely valuable. In science it is quite common to get false positives, where it looks like you’ve found interesting results but you just happened to sample some unusual items. Achieving statistical significance tells you that the interesting results are probably not just an accident of sampling. These tests protect the public from inaccurate data.

Like every valuable innovation, tests of statistical significance can be overused. I’ll talk about that in a future post.

Estimating universals, averages and percentages

In my previous post, I discussed the differences between existential and universal statements. In particular, the standard of evidence is different: to be sure that an existential statement is correct we only need to see one example, but to be sure a universal is correct we have to have examined everything.

But what if we don’t have the time to examine everything, and we don’t have to be absolutely sure? As it turns out, a lot of times we can be pretty sure. We just need a representative sample of everything. It’s quicker than examining every member of your population, and it may even be more accurate, since there are always measurement errors, and measuring a lot of things increases the chance of an error.

Pierre-Simon Laplace figured that out for the French Empire. In the early nineteenth century, Napoleon had conquered half Europe, but he didn’t have a good idea how many subjects he had. Based on the work of Thomas Bayes, Laplace knew that a relatively small sample of data would give him a good estimate. He also figured out that he needed the sample to be representative to get a good estimate.

“The most precise method consists of (1) choosing districts distributed in a roughly uniform manner throughout the Empire, in order to generalize the result, independently of local circumstances,” wrote Laplace in 1814. If you didn’t have a uniform distribution, you might wind up getting all your data from mountainous districts and underestimating the population, or getting data from urban districts and overestimating. Another way to avoid basing your generalizations on unrepresentative data is random sampling.

"Is Our Face Red!" says the Literary DigestA lot of social scientists, including linguists, understand the value of sampling. But many of them don’t understand that it’s representative sampling that has value. Unrepresentative samples are worse than no samples, because they can give you a false sense of certainty.

A famous example mentioned in Wikipedia is when a Literary Digest poll forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt in the 1936 Presidential election. That poll was biased because the sample was taken from lists of people who owned telephones and automobiles, and those people were not representative of the voters overall. The editors of the Literary Digest were not justified in generalizing those universal statements to the electorate as a whole, and thus failed to predict Roosevelt’s re-election.

"Average Italian Female" by Colin Spears

“Average Italian Female” by Colin Spears

What can be deceiving is that you get things that look like averages and percentages. And they are averages and percentages! But they’re not necessarily averages of the things you want an average of. A striking example comes from a blogger named Colin Spears, who was intrigued by a “facial averaging” site set up by some researchers at the University of Aberdeen (they’ve since moved to Glasgow). Spears uploaded pictures from 41 groups, including “Chad and Cameroonian” and created “averages.” These pictures were picked up by a number of websites, stripped of their credits, and bundled with all kinds of misleading and inaccurate information, as detailed by Lisa De Bruine, one of the creators of the software used by Spears.

Some bloggers, like Jezebel’s Margaret Hartmann, noted that the “averages” all looked to be around twenty years old, which is not the median age for most countries according to the CIA World Fact Book (which presumably relies on better samples). In fact, the median age for Italian women (see image) is 45.6. The average look of the image is in the twenties, because that’s the age of the images that Spears uploaded to the Aberdeen site. So we got averages of some Italian women, but nothing that actually represents the average (of all) Italian women. (Some blog posts about this even showed a very light-skinned face for “Average South African Woman,” but that was just a mislabeled “Average Argentine Woman.”)

Keep this in mind the next time you see an average or a percentage. What was their sampling method? If it wasn’t uniform or random, it’s not an average or percentage of anything meaningful. If you trust it, you may wind up spreading inaccuracies, like a prediction for President Landon or a twentysomething average Italian woman. And won’t your face be red!