On this day in Parisian theater

Since I first encountered The Parisian Stage, I’ve been impressed by the completeness of Beaumont Wicks’s life’s work: from 1950 through 1979 he compiled a list of every play performed in the theaters of Paris between 1800 and 1899. I’ve used it as the basis for my Digital Parisian Stage corpus, currently a one percent sample of the first volume (Wicks 1950), available in full text on GitHub.

Last week I had an idea for another project. Science requires both qualitative and quantitative research, and I’ve admired Neil Freeman’s @everylotnyc Twitter bot as a project that conveys the diversity of the underlying data and invites deep, qualitative exploration.

In 2016, with Timm Dapper, Elber Carneiro and Laura Silver I forked Freeman’s everylotbot code to create @everytreenyc, a random walk through the New York City Parks Department’s 2015 street tree census. Every three hours during normal New York active time, the bot tweets information about a tree from the database, in a template written by Laura that may also include topical, whimsical sayings.

Recently I’ve encountered a lot of anniversaries. A lot of it is connected to the centenary of the First World War I, but some is more random: I just listened to an episode of la Fabrique de l’histoire about François Mitterrand’s letters to his mistress that was promoted with the fact that he was born in 1916, one hundred years before that episode aired, even though he did not start writing those letters until 1962.

There are lots of “On this day” blogs and Twitter feeds, such as the History Channel and the New York Times, and even specialized feeds like @ThisDayInMETAL. There are #OnThisDay and #otd hashtags, and in French #CeJourLà. The “On this day” feeds have two things in common: they tend to be hand-curated, and they jump around from year to year. For April 13, 2014, the @CeJourLa feed tweeted events from 1849, 1997, 1695 and 1941, in that order.

Two weeks ago I was at the Annual Convention of the Modern Language Association, describing my Digital Parisian Stage corpus, and I realized that in the Parisian Stage there were plays being produced exactly two hundred years ago. I thought of the #OnThisDay feeds and @everytreenyc, and realized that I could create a Twitter bot to pull information about plays from the database and tweet them out. A week later, @spectacles_xix sent out its first automated tweet, about the play la Réconciliation par ruse.

@spectacles_xix runs on Pythonanywhere in Python 3.6, and accesses a MySQL database. It uses Mike Verdone’s Twitter API client. The source is open on GitHub.

Unlike other feeds, including this one from the French Ministry of Culture that just tweeted about the anniversary of the première of Rostand’s Cyrano de Bergerac, this one will not be curated, and it will not jump around from year to year. It will tweet every play that premièred in 1818, in order, until the end of the year, and then go on to 1819. If there is a day when no plays premièred, like January 16, @spectacles_xix will not tweet.
I have a couple of ideas about more features to add, so stay tuned!

And we mean really every tree!

When Timm, Laura, Elber and I first ran the @everytreenyc Twitter bot almost a year ago, we knew that it wasn’t actually sampling from a list that included every street tree in New York City. The Parks Department’s 2015 Tree Census was a huge undertaking, and was not complete by the time they organized the Trees Count! Data Jam last June. There were large chunks of the city missing, particularly in Southern and Eastern Queens.

The bot software itself was not a bad job for a day’s work, but it was still a hasty patch job on top of Neil Freeman’s original Everylotbot code. I hadn’t updated the readme file to reflect the changed we had made. It was running on a server in the NYU Computer Science Department, which is currently my most precarious affiliation.

On April 28 I received an email from the Parks Department saying that the census was complete, and the final version had been uploaded to the NYC Open Data Portal. It seemed like a good opportunity to upgrade.

Over the past two weeks I’ve downloaded the final tree database, installed everything on Pythonanywhere, streamlined the code, added a function to deal with Pythonanywhere’s limited scheduler, and updated the readme file. People who follow the bot might have noticed a few extra tweets over the past couple of days as I did final testing, but I’ve removed the cron job at NYU, and @everytreenyc is now up and running in its new home, with the full database, a week ahead of its first birthday. Enjoy the dérive!

Is your face red?

In 1936, Literary Digest magazine made completely wrong predictions about the Presidential election. They did this because they polled based on a bad sample: driver’s licenses and subscriptions to their own magazine. Enough people who didn’t drive or subscribe to Literary Digest voted, and they voted for Roosevelt. The magazine’s editors’ faces were red, and they had the humility to put that on the cover.

This year, the 538 website made completely wrong predictions about the Presidential election, and its editor, Nate Silver, sorta kinda took responsibility. He had put too much trust in polls conducted at the state level. They were not representative of the full spectrum of voter opinion in those states, and this had skewed his predictions.

Silver’s face should be redder than that, because he said that his conclusions were tentative, but he did not act like it. When your results are so unreliable and your data is so problematic, you have no business being on television and in front-page news articles as much as Silver has.

In part this attitude of Silver’s comes from the worldview of sports betting, where the gamblers know they want to bet and the only question is which team they should put their money on. There is some hedging, but not much. Democracy is not a gamble, and people need to be prepared for all outcomes.

But the practice of blithely making grandiose claims based on unrepresentative data, while mouthing insincere disclaimers, goes far beyond election polling. It is widespread in the social sciences, and I see it all the time in linguistics and transgender studies. It is pervasive in the relatively new field of Data Science, and Big Data is frequently Non-representative Data.

At the 2005 meeting of the American Association for Corpus Linguistics there were two sets of interactions that stuck with me and have informed my thinking over the years. The first was a plenary talk by the computer scientist Ken Church. He described in vivid terms the coming era of cheap storage and bandwidth, and the resulting big data boom.

But Church went awry when he claimed that the size of the datasets available, and the computational power to analyze them, would obviate the need for representative samples. It is true that if you can analyze everything you do not need a sample. But that’s not the whole story.

A day before Church’s talk I had had a conversation over lunch with David Lee, who had just written his dissertation on the sampling problems in the British National Corpus. Lee had reiterated what I had learned in statistics class: if you simply have most of the data but your data is incomplete in non-random ways, you have a biased sample and you can’t make generalizations about the whole.

I’ve seen this a lot in the burgeoning field of Data Science. There are too many people performing analyses they don’t understand on data that’s not representative, making unfounded generalizations. As long as these generalizations fit within the accepted narratives, nobody looks twice.

We need to stop making it easier to run through the steps of data analysis, and instead make it easier to get those steps right. Especially sampling. Or our faces are going to be red all the time.

Sampling is a labor-saving device

Last month I wrote those words on a slide I was preparing to show to the American Association for Corpus Linguistics, as a part of a presentation of my Digital Parisian Stage Corpus. I was proud of having a truly representative sample of theatrical texts performed in Paris between 1800 and 1815, and thus finding a difference in the use of negation constructions that was not just large but statistically significant. I wanted to convey the importance of this.

I was thinking about Laplace finding the populations of districts “distributed evenly throughout the Empire,” and Student inventing his t-test to help workers at the Guinness plants determine the statistical significance of their results. Laplace was not after accuracy, he was going for speed. Student was similarly looking for the minimum amount of effort required to produce an acceptable level of accuracy. The whole point was to free resources up for the next task.

I attended one paper at the conference that gave p-values for all its variables, and they were all 0.000. After that talk, I told the student who presented that those values indicated he had oversampled, and he should have stopped collecting data much sooner. “That’s what my advisor said too,” he said, “but this way we’re likely to get statistical significance for other variables we might want to study.”

The student had a point, but it doesn’t seem very – well, “agile” is a word I’ve been hearing a lot lately. In any case, as the conference was wrapping up, it occurred to me that I might have several hours free – on my flight home and before – to work on my research.

My initial impulse was to keep doing what I’ve been doing for the past couple of years: clean up OCRed text and tag it for negation. Then it occurred to me that I really ought to take my own advice. I had achieved statistical significance. That meant it was time to move on!

I have started working on the next chunk of the nineteenth century, from 1816 through 1830. I have also been looking into other variables to examine. I’ve got some ideas, but I’m open to suggestions. Send them if you have them!

@everytreenyc

At the beginning of June I participated in the Trees Count Data Jam, experimenting with the results of the census of New York City street trees begun by the Parks Department in 2015. I had seen a beta version of the map tool created by the Parks Department’s data team that included images of the trees pulled from the Google Street View database. Those images reminded me of others I had seen in the @everylotnyc twitter feed.

@everylotnyc is a Twitter bot that explores the City’s property database. It goes down the list in order by taxID number. Every half hour it compose a tweet for a property, consisting of the address, the borough and the Street View photo. It seems like it would be boring, but some people find it fascinating. Stephen Smith, in particular, has used it as the basis for some insightful commentary.

It occurred to me that @everylotnyc is actually a very powerful data visualization tool. When we think of “big data,” we usually think of maps and charts that try to encompass all the data – or an entire slice of it. The winning project from the Trees Count Data Jam was just such a project: identifying correlations between cooler streets and the presence of trees.

Social scientists, and even humanists recently, fight over quantitative and qualitative methods, but the fact is that we need them both. The ethnographer Michael Agar argues that distributional claims like “5.4 percent of trees in New York are in poor condition” are valuable, but primarily as a springboard for diving back into the data to ask more questions and answer them in an ongoing cycle. We also need to examine the world in detail before we even know which distributional questions to ask.

If our goal is to bring down the percentage of trees in Poor condition, we need to know why those trees are in Poor condition. What brought their condition down? Disease? Neglect? Pollution? Why these trees and not others?

Patterns of neglect are often due to the habits we develop of seeing and not seeing. We are used to seeing what is convenient, what is close, what is easy to observe, what is on our path. But even then, we develop filters to hide what we take to be irrelevant to our task at hand, and it can be hard to drop these filters. We can walk past a tree every day and not notice it. We fail to see the trees for the forest.

Privilege filters our experience in particular ways. A Parks Department scientist told me that the volunteer tree counts tended to be concentrated in wealthier areas of Manhattan and Brooklyn, and that many areas of the Bronx and Staten Island had to be counted by Parks staff. This reflects uneven amounts of leisure time and uneven levels of access to city resources across these neighborhoods, as well as uneven levels of walkability.

A time-honored strategy for seeing what is ordinarily filtered out is to deviate from our usual patterns, either with a new pattern or with randomness. This strategy can be traced at least as far as the sampling techniques developed by Pierre-Simon Laplace for measuring the population of Napoleon’s empire, the forerunner of modern statistical methods. Also among Laplace’s cultural heirs are the flâneurs of late nineteenth-century Paris, who studied the city by taking random walks through its crowds, as noted by Charles Baudelaire and Walter Benjamin.

In the tradition of the flâneurs, the Situationists of the mid-twentieth century highlighted the value of random walks, that they called dérives. Here is Guy Debord (1955, translated by Ken Knabb):

The sudden change of ambiance in a street within the space of a few meters; the evident division of a city into zones of distinct psychic atmospheres; the path of least resistance which is automatically followed in aimless strolls (and which has no relation to the physical contour of the ground); the appealing or repelling character of certain places — these phenomena all seem to be neglected. In any case they are never envisaged as depending on causes that can be uncovered by careful analysis and turned to account. People are quite aware that some neighborhoods are gloomy and others pleasant. But they generally simply assume that elegant streets cause a feeling of satisfaction and that poor streets are depressing, and let it go at that. In fact, the variety of possible combinations of ambiances, analogous to the blending of pure chemicals in an infinite number of mixtures, gives rise to feelings as differentiated and complex as any other form of spectacle can evoke. The slightest demystified investigation reveals that the qualitatively or quantitatively different influences of diverse urban decors cannot be determined solely on the basis of the historical period or architectural style, much less on the basis of housing conditions.

In an interview with Neil Freeman, the creator of @everylotbot, Cassim Shepard of Urban Omnibus noted the connections between the flâneurs, the dérive and Freeman’s work. Freeman acknowledged this: “How we move through space plays a huge and under-appreciated role in shaping how we process, perceive and value different spaces and places.”

Freeman did not choose randomness, but as he describes it in a tinyletter, the path of @everylotbot sounds a lot like a dérive:

@everylotnyc posts pictures in numeric order by Tax ID, which means it’s posting pictures in a snaking line that started at the southern tip of Manhattan and is moving north. Eventually it will cross into the Bronx, and in 30 years or so, it will end at the southern tip of Staten Island.

Freeman also alluded to the influence of Alfred Korzybski, who coined the phrase, “the map is not the territory”:

Streetview and the property database are both a widely used because they’re big, (putatively) free, and offer a completionist, supposedly comprehensive view of the world. They’re also both products of people working within big organizations, taking shortcuts and making compromises.

I was not following @everylotnyc at the time, but I knew people who did. I had seen some of their retweets and commentaries. The bot shows us pictures of lots that some of us have walked past hundreds of times, but seeing it in our twitter timelines makes us see it fresh again and notice new things. It is the property we know, and yet we realize how much we don’t know it.

When I thought about those Street View images in the beta site, I realized that we could do the same thing for trees for the Trees Count Data Jam. I looked, and discovered that Freeman had made his code available on Github, so I started implementing it on a server I use. I shared my idea with Timm Dapper, Laura Silver and Elber Carneiro, and we formed a team to make it work by the deadline.

It is important to make this much clear: @everytreenyc may help to remind us that no census is ever flawless or complete, but it is not meant as a critique of the enterprise of tree counts. Similarly, I do not believe that @everylotnyc was meant as an indictment of property databases. On the contrary, just as @everylotnyc depends on the imperfect completeness of the New York City property database, @everytreenyc would not be possible without the imperfect completeness of the Trees Count 2015 census.

Without even an attempt at completeness, we could have no confidence that our random dive into the street forest was anything even approaching random. We would not be able to say that following the bot would give us a representative sample of the city’s trees. In fact, because I know that the census is currently incomplete in southern and eastern Queens, when I see trees from the Bronx and Staten Island and Astoria come up in my timeline I am aware that I am missing the trees of southeastern Queens, and awaiting their addition to the census.

Despite that fact, the current status of the 2015 census is good enough for now. It is good enough to raise new questions: what about that parking lot? Is there a missing tree in the Street View image because the image is newer than the census, or older? It is good enough to continue the cycle of diving and coming up, of passing through the funnel and back up, of moving from quantitative to qualitative and back again.

Sampling and the digital humanities

I was pleased to have the opportunity to announce some progress on my Digital Parisian Stage project in a lightning talk at the kickoff event for New York City Digital Humanities Week on Tuesday. One theme that was expressed by several other digital humanists that day was the sheer volume of interesting stuff being produced daily, and collected in our archives.

I was particularly struck by Micki McGee’s story of how working on the Yaddo archive challenged her commitment to “horizontality” – flattening hierarchies, moving beyond the “greats” and finding valuable work and stories beyond the canon. The archive was simply too big for her to give everyone the treatment they deserved. She talked about using digital tools to overcome that size, but still was frustrated in the end.

At the KeystoneDH conference this summer I found out about the work of Franco Moretti, who similarly uses digital tools to analyze large corpora. Moretti’s methods seem very useful, but on Tuesday we saw that a lot of people were simply not satisfied with “distant reading”:

I am of the school that sees quantitative and qualitative methods as two ends of a continuum of tools, all of which are necessary for understanding the world. This is not even a humanities thing: from geologists with hammers to psychologists in clinics, all the sciences rely on close observation of small data sets.

My colleague in the NYU Computer Science Department, Adam Myers, uses the same approach to do natural language processing; I have worked with him on projects like this (PDF. We begin with a close reading of texts from the chosen corpus, then decide on a set of interesting patterns to annotate. As we annotate more and more texts, the patterns come into sharper focus, and eventually we use these annotations to train machine learning routines.

One question that arises with these methods is what to look at first. There is an assumption of uniformity in physics and chemistry, so that scientists can assume that one milliliter of ethyl alcohol will behave more or less like any other milliliter of ethyl alcohol under similar conditions. People are much less interchangeable, leading to problems like WEIRD bias in psychology. Groups of people and their conventions are even more complex, making it even more unlikely that the easiest texts or images to study are going to give us an accurate picture of the whole archive.

Fortunately, this is a solved problem. Pierre-Simon Laplace figured out in 1814 that he could get a reasonable estimate of the population of the French Empire by looking at a representative sample of its départements, and subsequent generations have improved on his sampling techniques.

We may not be able to analyze all the things, but if we study enough of them we may be able to get a good idea of what the rest are like. William Sealy “Student” Gosset developed his famous t-test precisely to avoid having to analyze all the things. His employers at the Guinness Brewery wanted to compare different strains of barley without testing every plant in the batch. The p-value told them whether they had sampled enough plants.

I share McGee’s appreciation of “horizontality” and looking beyond the greats, and in my Digital Parisian Stage corpus I achieved that horizontality with the methods developed by Laplace and Student. The creators of the FRANTEXT corpus chose its texts using the “principle of authority,” in essence just using the greats. For my corpus I built on the work of Charles Beaumont Wicks, taking a random sample from his list of all the plays performed in Paris between 1800 and 1815.

What I found was that characters in the randomly selected plays used a lot less of the conservative ne alone construction to negate sentences than characters in the FRANTEXT plays. This seems to be because the FRANTEXT plays focused mostly on aristocrats making long declamatory speeches, while the randomly selected plays also included characters who were servants, peasants, artisans and bourgeois, often in faster-moving dialogue. The characters from the lower classes tended to use much more of the ne … pas construction, while the aristocrats tended to use ne alone.

Student’s t-test tells me that the difference I found in the relative frequency of ne alone in just four plays was big enough that I could be confident of finding the same pattern in other plays. Even so, I plan to produce the full one percent sample (31 plays) so that I can test for differences that might be smaller

It’s important for me to point out here that this kind of analysis still requires a fairly close reading of the text. Someone might say that I just haven’t come up with the right regular expression or parser, but at this point I don’t know of any automatic tools that can reliably distinguish the negation phenomena that interest me. I find that to really get an accurate picture of what’s going on I have to not only read several lines before and after each instance of negation, but in fact the entire play. Sampling reduces the number of times I have to do that reading, to bring the overall workload down to a reasonable level.

Okay, you may be saying, but I want to analyze all the things! Even a random sample isn’t good enough. Well, if you don’t have the time or the money to analyze all the things, a random sample can make the case for analyzing everything. For example, I found several instances of the pas alone construction, which is now common but was rare in the early nineteenth century. I also turned up the script for a pantomime about the death of Captain Cook that gave the original Hawaiian characters a surprising (given what little I knew about these attitudes) level of intelligence and agency.

If either of those findings intrigued you and made you want to work on the project, or fund it, or hire me, that illustrates another use of sampling. (You should also email me.) Sampling gives us a place to start outside of the “greats,” where we can find interesting information that may inspire others to get involved.

One final note: the first step to getting a representative sample is to have a catalog. You won’t be able to generalize to all the things until you have a list of all the things. This is why my Digital Parisian Stage project owes so much to Beaumont Wicks. This “paper and ink” humanist spent his life creating a list of every play performed in Paris in the nineteenth century – the catalog that I sampled for my corpus.

Why I probably won’t take your survey

I wrote recently that if you want to be confident in generalizing observations from a sample to the entire population, your sample needs to be representative. But maybe you’re skeptical. You might have noticed that a lot of people don’t pay much attention to representativeness, and somehow there are hardly any consequences for them. But that doesn’t mean that there are never consequences, for them or other people.

In the “hard sciences,” sampling can be easier. Unless there is some major impurity, a liter of water from New York usually has the same properties as one from Buenos Aires. If you’re worried about impurities you can distill the samples to increase the chance that they’re the same. Similarly, the commonalities in a basalt column or a wheel often outweigh any variation. A pigeon in New York is the same as one in London, right? A mother in New York is the same as a mother in Buenos Aires

Well, maybe. As we’ve seen, a swan in New York can be very different from a swan in Sydney. And when we get into the realm of social sciences, things get more complex and the complexity gets hard to avoid. There are probably more differences between a mother in New York and one in Buenos Aires than for pigeons or stones or water, and the differences are more important to more people.

This is not just speculation based on rigid rules about sampling. As Bethany Brookshire wrote last year, psychologists are coming to realize the drawbacks of building so much of their science around WEIRD people. And when she says WEIRD, she means WEIRD like me: White, Educated and from an Industrialized, Rich, Democratic country. And not just any WEIRD people, but college sophomores. Brookshire points out how much that skews the results in a particular study of virginity, but she also links to a review by Heinrich, Heine and Norenzayan (2010) that examines several studies and concludes that “members of WEIRD societies, including young children, are among the least representative populations one could find for generalizing about humans.”

I think about this whenever I get an invitation to participate in a social science study. I get them pretty frequently, probably at least twice a week, on email lists and Twitter, and occasionally Tumblr and even Facebook. Often they’re directly from the researchers themselves: “Native English speakers, please fill out my questionnaire on demonstratives!” That means that they’re going primarily to a population of educated people, most of whom are white from an industrialized, rich, democratic country.

(A quick reminder, in case you just tuned in: This applies to universal observations – percentages, averages and all or none statements. It does not apply to existential statements, where you simply say that you found ten people who say “less apples.” You take those wherever you find them, as long as they’re reliable sources.)

I don’t have a real problem with using non-representative samples for pilot studies. You have a hunch about something, you want to see if it’s not just you before you spend a lot of time sending a survey out to people you don’t know. I have a huge problem with it being used for anything that’s published in a peer-reviewed journal or disseminated in the mainstream media. And yeah, that means I have a huge problem with just about any online dialect survey.

I also don’t like the idea of students generalizing universal observations from non-representative online surveys for their term papers and master’s theses. People learn skills by doing. If they get practice taking representative samples, they’ll know how to do that. If they get practice making qualitative, existential observations, they’ll be able to do those. If they spend their time in school making unfounded generalizations from unrepresentative samples (with a bit of handwaving boilerplate, of course!), most of them will keep doing that after they graduate.

So that’s my piece. I’m actually going to keep relatively quiet about this because some of the people who do those studies (or their friends) might be on hiring committees, but I do want to at least register my objections here. And if you’re wondering why I haven’t filled out your survey, or even forwarded it to all my friends, this is your answer.

You can’t get significance without a representative sample

Recently I’ve talked about the different standards for existential and universal claims, how we can use representative samples to estimate universal claims, and how we know if our representative sample is big enough to be “statistically significant.” But I want to add a word of caution to these tests: you can’t get statistical significance without a representative sample.

If you work in social science you’ve probably seen p-values reported in studies that aren’t based on representative samples. They’re probably there because the authors took one required statistics class in grad school and learned that low p-values are good. It’s quite likely that these p-values were actually expected, if not explicitly requested, by the editors or reviewers of the article, who took a similar statistics class. And they’re completely useless.

P-values tell you whether your observation (often a mean, but not always) is based on a big enough sample that you can be 99% (or whatever) sure it’s not the luck of the draw. You are clear to generalize your representative sample to the entire population. But if your sample is not representative, it doesn’t matter!

Suppose you need 100% pure Austrian pumpkin seed oil, and you tell your friend to make sure he gets only the 100% pure kind. Your friend brings you 100% pure Australian tea tree oil. They’re both oils, and they’re both 100% pure, so your friend doesn’t understand why you’re so frustrated with him. But purity is irrelevant when you’ve got the wrong oil. P-values are the same way.

So please, don’t report p-values if you don’t have a representative sample. If the editor or reviewer insists, go ahead and put it in, but please roll your eyes while you’re running your t-tests. But if you are the editor or reviewer, please stop asking people for p-values if they don’t have a representative sample! Oh, and you might want to think about asking them to collect a representative sample…

How big a sample do you need?

In my post last week I talked about the importance of representative samples for making universal statements, including averages and percentages. But how big should your sample be? You don’t need to look at everything, but you probably need to look at more than one thing. How big a sample do you need in order to be reasonably sure of your estimates?

One of the pioneers in this area was a mysterious scholar known only to the public as Student. He took that name because he had been a student of the statistician Karl Pearson, and because he was generally a modest person. After his death, he was revealed to be William Sealy Gosset, Head Brewer for the Guinness Brewery. He had published his findings (PDF) under a pseudonym so that the competing breweries would not realize the relevance of his work to brewing.

Pearson had connected sampling to probability, because for every item sampled there is a chance that it is not a good example of the population as a whole. He used the probability integral transformation, which required relatively large samples. Pearson’s preferred application was biometrics, where it was relatively easy to collect samples and get a good estimate of the probability integral.

The Guinness brewery was experimenting with different varieties of barley, looking for ones that would yield the most grain for brewing. The cost of sampling barley added up over time, and the number of samples that Pearson used would have been too expensive. Student’s t-test saved his employer money by making it easy to tell whether they had the minimum sample size that they needed for good estimates.

Both Pearson’s and Student’s methods resulted in equations and tables that allowed people to estimate the probability that the mean of their sample is inaccurate. This can be expressed as a margin of error or as a confidence interval, or as the p-value of the mean. The p-value depends on the number of items in your sample, and how much they vary from each other. The bigger the sample and the smaller the variance, the smaller the p-value. The smaller the p-value, the more likely it is that your sample mean is close to the actual mean of the population you’re interested in. For Student, a small p-value meant that the company didn’t have to go out and test more barley crops.

Before you gather your sample, you decide how much uncertainty you’re willing to tolerate, in other words, a maximum p-value designated by α (alpha). When a sample’s p-value is lower than the α-value, it is said to be significant. One popular α-value is 0.05, but this is often decided collectively, and enforced by journal editors and thesis advisors who will not accept an article where the results don’t meet their standards for statistical significance.

The tests of significance determined by Pearson, Student, Ronald Fisher and others are hugely valuable. In science it is quite common to get false positives, where it looks like you’ve found interesting results but you just happened to sample some unusual items. Achieving statistical significance tells you that the interesting results are probably not just an accident of sampling. These tests protect the public from inaccurate data.

Like every valuable innovation, tests of statistical significance can be overused. I’ll talk about that in a future post.