How big a sample do you need?

In my post last week I talked about the importance of representative samples for making universal statements, including averages and percentages. But how big should your sample be? You don’t need to look at everything, but you probably need to look at more than one thing. How big a sample do you need in order to be reasonably sure of your estimates?

One of the pioneers in this area was a mysterious scholar known only to the public as Student. He took that name because he had been a student of the statistician Karl Pearson, and because he was generally a modest person. After his death, he was revealed to be William Sealy Gosset, Head Brewer for the Guinness Brewery. He had published his findings (PDF) under a pseudonym so that the competing breweries would not realize the relevance of his work to brewing.

Pearson had connected sampling to probability, because for every item sampled there is a chance that it is not a good example of the population as a whole. He used the probability integral transformation, which required relatively large samples. Pearson’s preferred application was biometrics, where it was relatively easy to collect samples and get a good estimate of the probability integral.

The Guinness brewery was experimenting with different varieties of barley, looking for ones that would yield the most grain for brewing. The cost of sampling barley added up over time, and the number of samples that Pearson used would have been too expensive. Student’s t-test saved his employer money by making it easy to tell whether they had the minimum sample size that they needed for good estimates.

Both Pearson’s and Student’s methods resulted in equations and tables that allowed people to estimate the probability that the mean of their sample is inaccurate. This can be expressed as a margin of error or as a confidence interval, or as the p-value of the mean. The p-value depends on the number of items in your sample, and how much they vary from each other. The bigger the sample and the smaller the variance, the smaller the p-value. The smaller the p-value, the more likely it is that your sample mean is close to the actual mean of the population you’re interested in. For Student, a small p-value meant that the company didn’t have to go out and test more barley crops.

Before you gather your sample, you decide how much uncertainty you’re willing to tolerate, in other words, a maximum p-value designated by α (alpha). When a sample’s p-value is lower than the α-value, it is said to be significant. One popular α-value is 0.05, but this is often decided collectively, and enforced by journal editors and thesis advisors who will not accept an article where the results don’t meet their standards for statistical significance.

The tests of significance determined by Pearson, Student, Ronald Fisher and others are hugely valuable. In science it is quite common to get false positives, where it looks like you’ve found interesting results but you just happened to sample some unusual items. Achieving statistical significance tells you that the interesting results are probably not just an accident of sampling. These tests protect the public from inaccurate data.

Like every valuable innovation, tests of statistical significance can be overused. I’ll talk about that in a future post.

Estimating universals, averages and percentages

In my previous post, I discussed the differences between existential and universal statements. In particular, the standard of evidence is different: to be sure that an existential statement is correct we only need to see one example, but to be sure a universal is correct we have to have examined everything.

But what if we don’t have the time to examine everything, and we don’t have to be absolutely sure? As it turns out, a lot of times we can be pretty sure. We just need a representative sample of everything. It’s quicker than examining every member of your population, and it may even be more accurate, since there are always measurement errors, and measuring a lot of things increases the chance of an error.

Pierre-Simon Laplace figured that out for the French Empire. In the early nineteenth century, Napoleon had conquered half Europe, but he didn’t have a good idea how many subjects he had. Based on the work of Thomas Bayes, Laplace knew that a relatively small sample of data would give him a good estimate. He also figured out that he needed the sample to be representative to get a good estimate.

“The most precise method consists of (1) choosing districts distributed in a roughly uniform manner throughout the Empire, in order to generalize the result, independently of local circumstances,” wrote Laplace in 1814. If you didn’t have a uniform distribution, you might wind up getting all your data from mountainous districts and underestimating the population, or getting data from urban districts and overestimating. Another way to avoid basing your generalizations on unrepresentative data is random sampling.

"Is Our Face Red!" says the Literary DigestA lot of social scientists, including linguists, understand the value of sampling. But many of them don’t understand that it’s representative sampling that has value. Unrepresentative samples are worse than no samples, because they can give you a false sense of certainty.

A famous example mentioned in Wikipedia is when a Literary Digest poll forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt in the 1936 Presidential election. That poll was biased because the sample was taken from lists of people who owned telephones and automobiles, and those people were not representative of the voters overall. The editors of the Literary Digest were not justified in generalizing those universal statements to the electorate as a whole, and thus failed to predict Roosevelt’s re-election.

"Average Italian Female" by Colin Spears

“Average Italian Female” by Colin Spears

What can be deceiving is that you get things that look like averages and percentages. And they are averages and percentages! But they’re not necessarily averages of the things you want an average of. A striking example comes from a blogger named Colin Spears, who was intrigued by a “facial averaging” site set up by some researchers at the University of Aberdeen (they’ve since moved to Glasgow). Spears uploaded pictures from 41 groups, including “Chad and Cameroonian” and created “averages.” These pictures were picked up by a number of websites, stripped of their credits, and bundled with all kinds of misleading and inaccurate information, as detailed by Lisa De Bruine, one of the creators of the software used by Spears.

Some bloggers, like Jezebel’s Margaret Hartmann, noted that the “averages” all looked to be around twenty years old, which is not the median age for most countries according to the CIA World Fact Book (which presumably relies on better samples). In fact, the median age for Italian women (see image) is 45.6. The average look of the image is in the twenties, because that’s the age of the images that Spears uploaded to the Aberdeen site. So we got averages of some Italian women, but nothing that actually represents the average (of all) Italian women. (Some blog posts about this even showed a very light-skinned face for “Average South African Woman,” but that was just a mislabeled “Average Argentine Woman.”)

Keep this in mind the next time you see an average or a percentage. What was their sampling method? If it wasn’t uniform or random, it’s not an average or percentage of anything meaningful. If you trust it, you may wind up spreading inaccuracies, like a prediction for President Landon or a twentysomething average Italian woman. And won’t your face be red!

Black swans exist

Existentials and universals

There’s a famous story about swans that Nasim Taleb used for the title of his recent book. European zoologists had seen swans, and all the swans they had seen had white feathers, so they said that all the swans in the world were in fact white. Then a European went to Australia and saw swans with black feathers. Taleb’s point is that no matter how much we know, we don’t know what we don’t know, and overconfidence in our knowledge can make us rigid and vulnerable.

Sentences like “There are swans,” or “there are black swans” are what logicians call existential statements. “All swans are white” is a universal statement. In science, the two are very different and require different kinds of evidence.

All it takes to make an existential statement is a single observation. I know that there are swans because I’ve seen at least one. But to say that all swans are white with certainty requires us to have seen every swan. The European zoologists made that universal statement without universal observations, and all it took was one observation of a black swan to prove them wrong.

Now here’s a point that seems to be missed: averages are universal statements. Actually, they’re made with some relatively simple arithmetic applied to data from all members of a category, so they entail universal statements. If the Regal Swan Foundation says “Mute Swans’ wingspans average 78 inches,” we take that to mean someone measured all the mute swans (or a representative sample; I’ll get into that in another post). If someone says that and then you find out they only measured swans in their local park, you’d feel deceived, wouldn’t you?

Here’s another one: percentages are universal statements. How can it be a universal statement to say that ten percent of people are left-handed? Well, you have to look at all of them to know for sure that the rest are not. And the same thing goes for “most,” “the majority” and other words that entail statements about fractions. I think this is why Robyn Hitchcock said, “the silent majority is the crime of the century”: if the silent majority really is silent, there’s no way for the person making the claim to know what they think.

On the other hand, statements about units and fractions of units are only existential statements. I have two cats, and I ate half a sausage roll for dinner last night. Words like “many,” “some” and “a few” are also existential statements. I’ve seen many white swans, but I claim no knowledge of the ones I haven’t seen. The only way you could falsify that statement is by process of elimination: showing that there aren’t enough swans in my experience to count as “many.”

So the bottom line is that in order to make a universal statement, including a percentage or an average, you need to have looked at all the members of the group. But in order to make an existential statement, you only need one.

Now here’s why I’m writing this post: this stuff sounds simple, and I feel like I’m writing it for kids, but there are a lot of people who don’t follow it. Either they’ve never been taught about the standards of evidence for universals, or they’ve been taught to ignore them. Most of the times I’ve read “all,” people seem to get that it’s a universal and don’t say it unless they can reasonably claim to have data from all of the things concerned. But a lot of people have trouble with “most” and averages, and especially percentages. You can’t say “most” unless you’ve seen them all.

Well, there actually is a way you can say “most” without seeing them all. It’s called induction, but it’s hard to do, harder than a lot of people seem to think. I’ll talk about that in a future post.

The author, posing in an existential Black Swan T-shirt (Update, August 19, 2015: I liked the black swan drawing above so much I decided that it should exist on T-shirts. And you too can call into existence T-shirts, throw pillows, travel mugs – all with the Existential Black Swan on them! I’ll get a cut of the money. Click here for details.)

Ernst Mach

I’m an instrumentalist. Are you one too?

Over the past few years I’ve realized that there are a lot of scientists who have a different view of science than I do, and most of them don’t even know about my way of thinking. But my way of thinking about science – Instrumentalism – is cool! I’m writing this post to explain what Instrumentalism is, and why I prefer it to other ways of thinking about science. At the very least I can link back to this from future posts so that you’ll understand why I say certain things. Maybe you’ll agree with me that Instrumentalism is cool. Maybe you’re already an Instrumentalist and you didn’t even know it!

Instrumentalism is the idea that scientific theories can never be proven wrong or right. Instead, theories are tools for understanding and prediction. They can be judged as better or worse tools, but that depends entirely on the context: what’s being explained to who and what’s being predicted, under what circumstances. Scientific models have the same status. In one of the most famous cases, the movement of the sun relative to the Earth, neither Ptolemy’s geocentric model nor Copernicus’s heliocentric model would be considered “true” or “false.”

This view of theories does not mean that there is no truth or falsity in science. Observations can still be accurate or inaccurate. And critically, hypotheses can be confirmed or rejected. These hypotheses are usually based on a theory, and a theory that predicts a lot of falsified hypotheses is not a very useful theory. So the heliocentric model is more useful than the simple geocentric model for predicting the movements of planets because those predictions are more often correct.

On the other hand, the geocentric model must still be useful, because most people continue to use it every day. If you’ve ever talked about the sun “rising,” you’ve used a geocentric model. It’s a lot easier than talking about part of the earth’s surface rotating away from the direct rays of the sun. The geocentric model’s predictions about the sun’s behavior are perfectly adequate for day to day human activity.

Since theories are tools for understanding, they are more useful if they are based on a simple analogy to something familiar. The geocentric model compares the sun to flying birds and jumping horses, or to spheres and hoops. In order to explain the apparent “retrograde motion” of the planets, astronomers added the ugly and counterintuitive “epicycles” to the geocentric model. But the sun does not exhibit retrograde motion, so there are no epicycles to spoil its simplicity.

This means that an astronomer (or any of us watching a movie about space) will likely use both the heliocentric model and the geocentric model on the same day, or even in the same hour. In a view of science which says theories are true or false, what does it say about someone that they use two different theories to model the same phenomenon on the same day? Maybe that person is a hypocrite, or even worse, weak-minded for not having the will to consistently apply the true model.

In contrast, instrumentalism is pluralist, tolerant and understanding. Of course sometimes we all act like the sun goes around the earth. It’s the simplest, most straightforward way to think about it!