African American English has accents too

Diversity is notoriously subjective and difficult to pin down. In particular, we tend be impressed if we know the names of a lot of categories for something. We might think there are more mammal species than insect species, but biologists tell us that there are hundreds of thousands of species of beetles alone. This is true in language as well: we think of the closely-related Romance and Germanic languages as separate, while missing the incredible diversity of “dialects” of Chinese or Arabic.

This is also true of English. As an undergraduate I was taught that there were four dialects in American English: New England, North Midland, South Midland and Coastal Southern. Oh yeah, and New York and Black English. The picture for all of those is more complicated than it sounds, and I went to Chicago I discovered that there are regional varieties of African American English.

In 2012 Annie Minoff, a blogger for Chicago public radio station WBEZ, took this oversimplification for truth: “AAE is remarkable for being consistent across urban areas; that is, Boston AAE sounds like New York AAE sounds like L.A. AAE, etc.” Fortunately a commenter, Amanda Hope, challenged her on that assertion. Minoff confirmed the pattern in an interview with variationist Walt Wolfram, and posted a correction in 2013.

In 2013 I was preparing to teach a unit on language variation and didn’t want to leave my students as misinformed as I – or Minoff – had been. Many of my students were African American, and I saw no reason to spend most of the unit on white varieties and leave African American English as a footnote. But the documentation is spotty: I know of no good undergraduate-level discussion of variation in African American English.

A few years before I had found a video that some guy took of a party in a parking lot on the West Side of Chicago. It wasn’t ideal, but it sort of gave you an idea. The link was dead, so I typed “Chicago West Side” into Google. The results were not promising, so on a whim I added “accent” and that’s how I found my first accent tag video.

Accent tag videos are an amazing thing, and I could write a whole series of posts about them. Here was a young black woman from Chicago’s West Side, not only talking about her accent but illustrating it, with words and phrases to highlight its differences from other dialects. She even talks (as many people do in these videos) about how other African Americans hear her accent in other places, like North Carolina. You can compare it (as I did in class) with a similar video made by a young black woman from Raleigh (or New York or California), and the differences are impossible to ignore.

In fact, when Amanda Hope challenged Minoff’s received wisdom on African American regional variation, she used accent tag videos to illustrate her point. These videos are amazing, particularly for teaching about language and linguistics, and from then on I made extensive use of them in my courses. There’s also a video made by two adorable young English women, one from London and one from Bolton near Manchester, where you can hear their accents contrasted in conversation. I like that I can go not just around the country but around the world (Nigeria, Trinidad, Jamaica) illustrating the diversity of English just among women of African descent, who often go unheard in these discussions. I’ll talk more about accent tag videos in future posts.

You can also find evidence of regional variation in African American English on Twitter. Taylor Jones has a great post about it that also goes into the history of African American varieties of English.

Is your face red?

In 1936, Literary Digest magazine made completely wrong predictions about the Presidential election. They did this because they polled based on a bad sample: driver’s licenses and subscriptions to their own magazine. Enough people who didn’t drive or subscribe to Literary Digest voted, and they voted for Roosevelt. The magazine’s editors’ faces were red, and they had the humility to put that on the cover.

This year, the 538 website made completely wrong predictions about the Presidential election, and its editor, Nate Silver, sorta kinda took responsibility. He had put too much trust in polls conducted at the state level. They were not representative of the full spectrum of voter opinion in those states, and this had skewed his predictions.

Silver’s face should be redder than that, because he said that his conclusions were tentative, but he did not act like it. When your results are so unreliable and your data is so problematic, you have no business being on television and in front-page news articles as much as Silver has.

In part this attitude of Silver’s comes from the worldview of sports betting, where the gamblers know they want to bet and the only question is which team they should put their money on. There is some hedging, but not much. Democracy is not a gamble, and people need to be prepared for all outcomes.

But the practice of blithely making grandiose claims based on unrepresentative data, while mouthing insincere disclaimers, goes far beyond election polling. It is widespread in the social sciences, and I see it all the time in linguistics and transgender studies. It is pervasive in the relatively new field of Data Science, and Big Data is frequently Non-representative Data.

At the 2005 meeting of the American Association for Corpus Linguistics there were two sets of interactions that stuck with me and have informed my thinking over the years. The first was a plenary talk by the computer scientist Ken Church. He described in vivid terms the coming era of cheap storage and bandwidth, and the resulting big data boom.

But Church went awry when he claimed that the size of the datasets available, and the computational power to analyze them, would obviate the need for representative samples. It is true that if you can analyze everything you do not need a sample. But that’s not the whole story.

A day before Church’s talk I had had a conversation over lunch with David Lee, who had just written his dissertation on the sampling problems in the British National Corpus. Lee had reiterated what I had learned in statistics class: if you simply have most of the data but your data is incomplete in non-random ways, you have a biased sample and you can’t make generalizations about the whole.

I’ve seen this a lot in the burgeoning field of Data Science. There are too many people performing analyses they don’t understand on data that’s not representative, making unfounded generalizations. As long as these generalizations fit within the accepted narratives, nobody looks twice.

We need to stop making it easier to run through the steps of data analysis, and instead make it easier to get those steps right. Especially sampling. Or our faces are going to be red all the time.