You can’t get significance without a representative sample

Recently I’ve talked about the different standards for existential and universal claims, how we can use representative samples to estimate universal claims, and how we know if our representative sample is big enough to be “statistically significant.” But I want to add a word of caution to these tests: you can’t get statistical significance without a representative sample.

If you work in social science you’ve probably seen p-values reported in studies that aren’t based on representative samples. They’re probably there because the authors took one required statistics class in grad school and learned that low p-values are good. It’s quite likely that these p-values were actually expected, if not explicitly requested, by the editors or reviewers of the article, who took a similar statistics class. And they’re completely useless.

P-values tell you whether your observation (often a mean, but not always) is based on a big enough sample that you can be 99% (or whatever) sure it’s not the luck of the draw. You are clear to generalize your representative sample to the entire population. But if your sample is not representative, it doesn’t matter!

Suppose you need 100% pure Austrian pumpkin seed oil, and you tell your friend to make sure he gets only the 100% pure kind. Your friend brings you 100% pure Australian tea tree oil. They’re both oils, and they’re both 100% pure, so your friend doesn’t understand why you’re so frustrated with him. But purity is irrelevant when you’ve got the wrong oil. P-values are the same way.

So please, don’t report p-values if you don’t have a representative sample. If the editor or reviewer insists, go ahead and put it in, but please roll your eyes while you’re running your t-tests. But if you are the editor or reviewer, please stop asking people for p-values if they don’t have a representative sample! Oh, and you might want to think about asking them to collect a representative sample…

Speech role models

John Murphy of Georgia State published an article about using non-native speakers, and specifically the Spanish actor Javier Bardem, as models for teaching English as a Second Language (ESL) or as a foreign language (EFL). Mura Nava tweeted a blog post from Robin Walker connecting Murphy’s work to similar work by Kenworthy and Jenkins, Peter Roach and others. I tried something like this when I taught ESL back in 2010, more or less unaware of all the previous work that Murphy cites, and Mura Nava was interested to know how it went, so here’s the first part of a quick write-up.

When I was asked to teach a class in ESL Speech “Advanced Oral/Aural Communication” at Saint John’s University in the fall of 2010, I had taught French and Linguistics, but I had only tutored English one-on-one. My wife is an experienced professor of ESL and was a valuable source of advice, but our student populations and our goals were different, so I did not simply copy her methods.

One concept that I introduced was that of a Speech Role Model. When I was learning French, I found it invaluable to imitate entertainers; I’ve never met Jacques Dutronc, but I often say that he was one of my best French teachers because of the clever lyricists he worked with and his clear, wry delivery. He was just one of the many French people that I imitated to improve my pronunciation.

This was all back in the days of television and cassettes, and most of the French culture that we had access to here in the United States was filtered through the wine, Proust and Rohmer tastes of American Francophiles. As a geeky kid with a fondness for comedy I found Edith Piaf and even Gérard Depardieu too alien to emulate. I found out about Dutronc in college through a bootleg tape made for me by a student from France who lived down the hall, and then I had to study abroad in France to find more role models.

With today’s multimedia Internet technology, we have an incredible the ability to listen to millions of people from around the world. At Saint John’s I asked my students to choose a Speech Role Model for English: a native speaker that they personally admired and wanted to sound like. I was surprised by the number of students who named President Obama as their role model, including female students from China, but on reflection it was an obvious choice, as he is a clear, forceful and eloquent speaker. Other students chose actresses Meryl Streep and Jennifer Anniston, talk-show host Bill O’Reilly and local newscaster Pat Kiernan.

One notable choice, hip-hop artist Eminem, gave me the opportunity to discuss covert prestige and its challenges. Another, the character of Sheldon Cooper from the television series “The Big Bang Theory,” was too scripted, and I was debating whether to accept it when I discovered that it was just a cover so that the student could plagiarize crowdsourced transcriptions.

In subsequent assignments I asked the students to find a YouTube video of their role model and to transcribe a short excerpt. I then asked the students to record themselves imitating that excerpt from their Speech Role Models. Some of the students were engaged and interested, but others seemed frustrated and discouraged. When I listened to my students and comparing their speech to their chosen role models, I had an idea why. The students who were engaged were either naturally enthusiastic or good mimics, but the challenge was to motivate the others. There was so much distance between them and the native English speakers, much more than could be covered in a semester. That was when I thought of adding a non-native Second Speech Role Model. I’ll have to leave that for another post.

How big a sample do you need?

In my post last week I talked about the importance of representative samples for making universal statements, including averages and percentages. But how big should your sample be? You don’t need to look at everything, but you probably need to look at more than one thing. How big a sample do you need in order to be reasonably sure of your estimates?

One of the pioneers in this area was a mysterious scholar known only to the public as Student. He took that name because he had been a student of the statistician Karl Pearson, and because he was generally a modest person. After his death, he was revealed to be William Sealy Gosset, Head Brewer for the Guinness Brewery. He had published his findings (PDF) under a pseudonym so that the competing breweries would not realize the relevance of his work to brewing.

Pearson had connected sampling to probability, because for every item sampled there is a chance that it is not a good example of the population as a whole. He used the probability integral transformation, which required relatively large samples. Pearson’s preferred application was biometrics, where it was relatively easy to collect samples and get a good estimate of the probability integral.

The Guinness brewery was experimenting with different varieties of barley, looking for ones that would yield the most grain for brewing. The cost of sampling barley added up over time, and the number of samples that Pearson used would have been too expensive. Student’s t-test saved his employer money by making it easy to tell whether they had the minimum sample size that they needed for good estimates.

Both Pearson’s and Student’s methods resulted in equations and tables that allowed people to estimate the probability that the mean of their sample is inaccurate. This can be expressed as a margin of error or as a confidence interval, or as the p-value of the mean. The p-value depends on the number of items in your sample, and how much they vary from each other. The bigger the sample and the smaller the variance, the smaller the p-value. The smaller the p-value, the more likely it is that your sample mean is close to the actual mean of the population you’re interested in. For Student, a small p-value meant that the company didn’t have to go out and test more barley crops.

Before you gather your sample, you decide how much uncertainty you’re willing to tolerate, in other words, a maximum p-value designated by α (alpha). When a sample’s p-value is lower than the α-value, it is said to be significant. One popular α-value is 0.05, but this is often decided collectively, and enforced by journal editors and thesis advisors who will not accept an article where the results don’t meet their standards for statistical significance.

The tests of significance determined by Pearson, Student, Ronald Fisher and others are hugely valuable. In science it is quite common to get false positives, where it looks like you’ve found interesting results but you just happened to sample some unusual items. Achieving statistical significance tells you that the interesting results are probably not just an accident of sampling. These tests protect the public from inaccurate data.

Like every valuable innovation, tests of statistical significance can be overused. I’ll talk about that in a future post.

A tool for annotating corpora

My dissertation focused on the evolution of negation in French, and I’ve continued to study this change. In order to track the way that negation was used, I needed to collect a corpus of texts and annotate them. I developed a MySQL database to store the annotations (and later the texts themselves) and a suite of PHP scripts to annotate the texts and store them in the database. I then developed another suite of PHP scripts to query the database and tabulate the data in a form that could be imported into Microsoft Excel or a more specialized statistics package like SPSS.

I am continuing to develop these scripts. Since I finished my dissertation, I added the ability to load the entire text into the database, and revamped the front end with AJAX to streamline the workflow. The new front end actually works pretty well on a tablet and even a smartphone when there’s a stable internet connection, but I’d like to add the ability to annotate offline, on a workstation or a mobile device. I also need to redo the scripts that query the database and generate reports. Here’s what the annotation screen currently looks like:

I’ve put many hours of work into this annotation system, and it works so well for me, that it’s a shame I’m the only one who uses it. It would take some work to adapt it for other projects, but I’m interested in doing that. If you think this system might work for your project, please let me know ( and I’ll give you a closer look.

Estimating universals, averages and percentages

In my previous post, I discussed the differences between existential and universal statements. In particular, the standard of evidence is different: to be sure that an existential statement is correct we only need to see one example, but to be sure a universal is correct we have to have examined everything.

But what if we don’t have the time to examine everything, and we don’t have to be absolutely sure? As it turns out, a lot of times we can be pretty sure. We just need a representative sample of everything. It’s quicker than examining every member of your population, and it may even be more accurate, since there are always measurement errors, and measuring a lot of things increases the chance of an error.

Pierre-Simon Laplace figured that out for the French Empire. In the early nineteenth century, Napoleon had conquered half Europe, but he didn’t have a good idea how many subjects he had. Based on the work of Thomas Bayes, Laplace knew that a relatively small sample of data would give him a good estimate. He also figured out that he needed the sample to be representative to get a good estimate.

“The most precise method consists of (1) choosing districts distributed in a roughly uniform manner throughout the Empire, in order to generalize the result, independently of local circumstances,” wrote Laplace in 1814. If you didn’t have a uniform distribution, you might wind up getting all your data from mountainous districts and underestimating the population, or getting data from urban districts and overestimating. Another way to avoid basing your generalizations on unrepresentative data is random sampling.

"Is Our Face Red!" says the Literary DigestA lot of social scientists, including linguists, understand the value of sampling. But many of them don’t understand that it’s representative sampling that has value. Unrepresentative samples are worse than no samples, because they can give you a false sense of certainty.

A famous example mentioned in Wikipedia is when a Literary Digest poll forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt in the 1936 Presidential election. That poll was biased because the sample was taken from lists of people who owned telephones and automobiles, and those people were not representative of the voters overall. The editors of the Literary Digest were not justified in generalizing those universal statements to the electorate as a whole, and thus failed to predict Roosevelt’s re-election.

"Average Italian Female" by Colin Spears

“Average Italian Female” by Colin Spears

What can be deceiving is that you get things that look like averages and percentages. And they are averages and percentages! But they’re not necessarily averages of the things you want an average of. A striking example comes from a blogger named Colin Spears, who was intrigued by a “facial averaging” site set up by some researchers at the University of Aberdeen (they’ve since moved to Glasgow). Spears uploaded pictures from 41 groups, including “Chad and Cameroonian” and created “averages.” These pictures were picked up by a number of websites, stripped of their credits, and bundled with all kinds of misleading and inaccurate information, as detailed by Lisa De Bruine, one of the creators of the software used by Spears.

Some bloggers, like Jezebel’s Margaret Hartmann, noted that the “averages” all looked to be around twenty years old, which is not the median age for most countries according to the CIA World Fact Book (which presumably relies on better samples). In fact, the median age for Italian women (see image) is 45.6. The average look of the image is in the twenties, because that’s the age of the images that Spears uploaded to the Aberdeen site. So we got averages of some Italian women, but nothing that actually represents the average (of all) Italian women. (Some blog posts about this even showed a very light-skinned face for “Average South African Woman,” but that was just a mislabeled “Average Argentine Woman.”)

Keep this in mind the next time you see an average or a percentage. What was their sampling method? If it wasn’t uniform or random, it’s not an average or percentage of anything meaningful. If you trust it, you may wind up spreading inaccuracies, like a prediction for President Landon or a twentysomething average Italian woman. And won’t your face be red!