How Google’s Pixel Buds will change the world!

Scene: a quietly bustling bistro in Paris’s 14th Arrondissement.

SERVER: Oui, vous désirez?
PIXELBUDS: Yes, you desire?
TOURIST: Um, yeah, I’ll have the steak frites.
SERVER: Que les frites?
PIXELBUDS: Than fries?
TOURIST: No, at the same time.
SERVER: Alors, vous voulez le steak aussi?
PIXELBUDS: You want the steak too?
TOURIST: Yeah, I just ordered the steak.
SERVER: Okay, du steak, et des frites, en même temps.
PIXELBUDS: Okay, steak, and fries at the same time.
TOURIST: You got it.

(All translations by Google Translate. Photo: Alain Bachelier / Flickr.)

The Digital Parisian Stage is now on GitHub

For the past five years I’ve been working on a project, the Digital Parisian Stage, that aims to create a representative sample of Nineteenth-century Parisian theater. I’ve made really satisfying progress on the first stage, 1800 through 1815, which corresponds to the first volume of Charles Beaumont Wicks’s catalog, the Parisian Stage (1950). Of the initial one-percent sample (31 plays), I have obtained 24, annotated 15 and discarded three for length, for a current total of twelve plays.

At conferences like the Keystone Digital Humanities Conference and the American Association for Corpus Linguistics, I’ve presented results showing that these twelve plays cover a much wider and more innovative range of language than the four theatrical plays from this period in the FRANTEXT corpus, a sample drawn fifty years ago based on a “principle of authority.”

Just looking at declarative sentence negation, I found that in the FRANTEXT corpus the playwrights negate declarative sentences with the ne … pas construction 49 percent of the time. In the twelve randomly sampled plays, the playwrights used ne … pas 75 percent of the time to negate declarative sentences. Because this was a representative sample, I even have a p value below 0.01, based on a chi-square goodness of fit test!

This seems like a good point to release the twelve texts that I have OCRed and cleaned to the public. I have uploaded them to GitHub as HTML files. In this I have been partly inspired by the work of Alex Gil, now my colleague at Columbia University.

You can read them for your own entertainment (Jocrisse-maître et Jocrisse-valet is my favorite), stage your own production of them (I’ll buy tickets!) or use them as data for your scientific investigations. I hope that you will also consider contributing to the repository, by checking for errors in the existing texts, adding new texts from the catalog, or converting them to a different format like TEI or Markdown.

If you do use them in your own studies, please don’t forget to cite me along the lines given below, or even to contact me to discuss co-authorship!

Grieve-Smith, Angus B. (2016). The Digital Parisian Stage Corpus. GitHub.

“Said” for 2016 Word of the Year

I just got back from the American Association for Corpus Linguistics conference in Ames, Iowa, and I’m calling the Word of the Year: for 2016 it will be said.

You may think you know said. It’s the past participle of say. You’ve said it yourself many times. What’s so special about it?

What’s special was revealed by Jordan Smith, a graduate student at Iowa State, in his presentation on Saturday afternoon. said is becoming a determiner. It is grammaticizing.

In addition to its participial use (“once the words were said”) you’ve probably seen said used as an attributive adjective (“the said property”). It indicates that the noun it modifies refers to a person, place or thing that has been mentioned recently, with the same noun, and that the speaker/writer expects it to be active in the hearer/reader’s memory.

Attributive said is strongly associated with legal documents, as in its first recorded use in the English Parliament in 1327. The Oxford English Dictionary reports that said was used outside of legal contexts as early as 1973, in the English sitcom Steptoe and Son. In this context it was clearly a joke: a word that evoked law courts used in a lower-class colloquial context.

Jordan Smith examined uses of said in the Corpus of Contemporary American English (COCA) and found that attributive said has increasingly been used without the for several years now, and outside the legal domain. He observes that syntactic changes and increased frequency have been named by linguists like Joan Bybee as harbingers of grammaticization.

Grammaticization (also known as grammaticalization; search for both) is when an ordinary lexical item (like a noun, verb or adjective, or even a phrase) becomes a grammatical item (like a pronoun, preposition or auxiliary verb). For example, while is a noun meaning a period of time, but it was grammaticized to a conjunction indicating simultaneity. Used is an adjective meaning accustomed, as in “I was used to being lonely,” but has also become part of an auxiliary indicating habitual aspect as in “I used to be lonely.”

Jordan is suggesting that said is no longer just a verb or even an adjective, it’s our newest determiner in English. Determiners are an exclusive club of short words that modify nouns. They include articles like an and the, but also demonstratives like these and quantifiers like several.

Noun phrases without a determiner tend to refer to generic categories, as I have been doing with phrases like legal documents and grammaticization. That is clearly not what is going on with said girlfriend. Noun phrases with said refer to a specific item or group of items, in some sense even more so than noun phrases with the.

Thanks to the wireless Internet at the AACL, I began searching for of said on Twitter, and found a ton of examples. There are plenty for in said examples as well.

It’s not just happening in English. The analogous French ledit is also used outside the legal domain. Its reanalysis is a bit different, since it incorporates the article rather than replacing it. Like most noun modifiers in French it is inflected for gender and number. I haven’t found anything similar for Spanish.

In 2013 the American Dialect Society chose because as its Word of the Year. Because is already a conjunction, having grammaticized from the noun cause, but it has been reanalyzed again into a preposition, as in because science. Some theorists consider this to be a further step in grammaticization. And here is a twenty-first century prepositional phrase for you, folks: because (P) said (Det) relationship (N).

After Jordan’s presentation it struck me that said is an excellent candidate for the 2016 Word of the year. And if the ADS isn’t interested, maybe another organization like the International Cognitive Linguistics Association, can sponsor a Grammaticization of the Year.

Printing differences and material issues in Google Books

I am looking forward to presenting my Digital Parisian Stage corpus and the exciting results I’ve gotten from it so far at the American Association for Corpus Linguistics at Iowa State in September. In the meantime I’m continuing to process texts, working towards a one percent sample from the Napoleonic period (Volume 1 of the Wicks catalog).

One of the plays in my sample is les Mœurs du jour, ou l’école des femmes, a comedy by Collin-Harleville (also known as Jean-François Collin d’Harleville). I ran the initial OCR on a PDF scanned for the Google Books project. For reasons that will become clear, I will refer to it by its Google Books ID, VyBaAAAAcAAJ. When I went to clean up the OCR text, I discovered that it was missing pages 2-6. I emailed the Google Books team about this, and got the following response:


I’m guessing “a material issue” means that those pages were missing from the original paper copy, but I didn’t even bother emailing until the other day, since I found another copy in the Google Books database, with the ID kVwxUp_LPIoC.

Comparing the OCR text of VyBaAAAAcAAJ with the PDF of kVwxUp_LPIoC, I discovered some differences in spelling. For example, throughout the text, words that end in the old fashioned spelling -ois or -oit in VyBaAAAAcAAJ are spelled with the more modern -ais in kVwxUp_LPIoC. There is also a difference in the way “Madame” is abbreviated (“Mad.” vs. ““) and in which accented letters preserve their accents when set in small caps, and differences in pagination. Here is the entirety of Act III, Scene X in each copy:


Act III, Scene X in copy VyBaAAAAcAAJ

Act III, Scene X in kVwxUp_LPIoC

Act III, Scene X in copy kVwxUp_LPIoC

My first impulse was to look at the front matter and see if the two copies were identified as different editions or different printings. Unfortunately, they were almost identical, with the most notable differences being that VyBaAAAAcAAJ has an œ ligature in the title, while kVwxUp_LPIoC is signed by the playwright and marked as being a personal gift from him to an unspecified recipient. Both copies give the exact same dates: the play was first performed on the 7th of Thermidor in year VIII and published in the same year (1800).

The Google Books metadata indicate that kVwxUp_LPIoC was digitized from the Lyon Public Library, while VyBaAAAAcAAJ came from the Public Library of the Netherlands. The other copies I have found in the Google Books database, OyL1oo2CqNIC from the National Library of Naples and dPRIAAAAcAAJ from Ghent University, appear to be the same printing as kVwxUp_LPIoC, as does the copy from the National Library of France.

Since the -ais and spellings are closer to the forms used in France today, we might expect that kVwxUp_LPIoC and its cousins are from a newer printing. But in Act II, Scene XI I came across a difference that concerns negation, the variable that I have been studying for many years. The decadent Parisians Monsieur Basset and Madame de Verdie question whether marriage should be eternal. Our hero Formont replies that he has no reason not to remain with his wife forever. In VyBaAAAAcAAJ he says, “je n’ai pas de raisons,” while in kVwxUp_LPIoC he says “je n’ai point de raisons.”

Act III, Scene XI (page 75) in VyBaAAAAcAAJ

Act III, Scene XI (page 75) in VyBaAAAAcAAJ

Act III, Scene XI (page 78) in kVwxUp_LPIoC

Act III, Scene XI (page 78) in kVwxUp_LPIoC

In my dissertation study I found that the relative use of ne … point had already peaked by the nineteenth century, and was being overtaken by ne … pas. If this play fits the pattern, the use of the more conservative pattern in kVwxUp_LPIoC goes against the more innovative -ais and spellings.

I am not an expert in French Revolutionary printing (if anyone knows a good reference or contact, please let me know!). My best guess is that kVwxUp_LPIoC is from a limited early run, some copies of which were given to the playwright to give away, while VyBaAAAAcAAJ and the other -ais/ … point copies are from a larger, slightly later, printing.

In any case, it is clear that I should pick one copy and make it consistent with that. Since VyBaAAAAcAAJ is incomplete, I will try dPRIAAAAcAAJ. I will try to double-check all the spellings and wordings, but at the very least I will check all of the examples of negation against dPRIAAAAcAAJ as I annotate them.

Speech role models

John Murphy of Georgia State published an article about using non-native speakers, and specifically the Spanish actor Javier Bardem, as models for teaching English as a Second Language (ESL) or as a foreign language (EFL). Mura Nava tweeted a blog post from Robin Walker connecting Murphy’s work to similar work by Kenworthy and Jenkins, Peter Roach and others. I tried something like this when I taught ESL back in 2010, more or less unaware of all the previous work that Murphy cites, and Mura Nava was interested to know how it went, so here’s the first part of a quick write-up.

When I was asked to teach a class in ESL Speech “Advanced Oral/Aural Communication” at Saint John’s University in the fall of 2010, I had taught French and Linguistics, but I had only tutored English one-on-one. My wife is an experienced professor of ESL and was a valuable source of advice, but our student populations and our goals were different, so I did not simply copy her methods.

One concept that I introduced was that of a Speech Role Model. When I was learning French, I found it invaluable to imitate entertainers; I’ve never met Jacques Dutronc, but I often say that he was one of my best French teachers because of the clever lyricists he worked with and his clear, wry delivery. He was just one of the many French people that I imitated to improve my pronunciation.

This was all back in the days of television and cassettes, and most of the French culture that we had access to here in the United States was filtered through the wine, Proust and Rohmer tastes of American Francophiles. As a geeky kid with a fondness for comedy I found Edith Piaf and even Gérard Depardieu too alien to emulate. I found out about Dutronc in college through a bootleg tape made for me by a student from France who lived down the hall, and then I had to study abroad in France to find more role models.

With today’s multimedia Internet technology, we have an incredible the ability to listen to millions of people from around the world. At Saint John’s I asked my students to choose a Speech Role Model for English: a native speaker that they personally admired and wanted to sound like. I was surprised by the number of students who named President Obama as their role model, including female students from China, but on reflection it was an obvious choice, as he is a clear, forceful and eloquent speaker. Other students chose actresses Meryl Streep and Jennifer Anniston, talk-show host Bill O’Reilly and local newscaster Pat Kiernan.

One notable choice, hip-hop artist Eminem, gave me the opportunity to discuss covert prestige and its challenges. Another, the character of Sheldon Cooper from the television series “The Big Bang Theory,” was too scripted, and I was debating whether to accept it when I discovered that it was just a cover so that the student could plagiarize crowdsourced transcriptions.

In subsequent assignments I asked the students to find a YouTube video of their role model and to transcribe a short excerpt. I then asked the students to record themselves imitating that excerpt from their Speech Role Models. Some of the students were engaged and interested, but others seemed frustrated and discouraged. When I listened to my students and comparing their speech to their chosen role models, I had an idea why. The students who were engaged were either naturally enthusiastic or good mimics, but the challenge was to motivate the others. There was so much distance between them and the native English speakers, much more than could be covered in a semester. That was when I thought of adding a non-native Second Speech Role Model. I’ll have to leave that for another post.

On advising descriptively

Some nice people retweeted my post about being a humble prescriptivist, and I had some interesting reactions in the comments and on Twitter, but Peter Sokolowski had one that I wasn’t prepared for.

Jonathon Owen held up Robert Hall’s Leave Your Language Alone as an example of the kind of pure descriptivist that I was referring to, and Sokolowski tweeted:

After thinking it over, I’ve come to the conclusion that there really are two ideas of “descriptivism.” When writing my post I was thinking of the Robert Hall kind, which is the kind that most linguists talk about and aspire to – although I would agree with Sokolowski that we only wind up as hypocrites, loudly declaiming prescriptivism as we prescribe left and right. I think Sokolowski was thinking of a different kind of prescriptivism, as described by Jesse Sheidlower in an article that Sokolowski tweeted last year:

Descriptivism involves the objective description of the way a language works as observed in actual examples of the language. Descriptive advice — almost an oxymoron — about the acceptability of a word or construction is based solely on usage. If a word or expression is not found in careful or formal speech or writing, good descriptive practice requires the reporting of this information.

This kind of “descriptive advice” (I saw how you ducked “prescription” there, Sheidlower) is a venerable tradition with a long history in second language instruction. Most second language learners aspire to speak and write like native speakers, so it makes sense for their teachers to study the speech and writing of native speakers. As Battye, Hintze and Rowlett tell us, it was applied to instructing native speakers on “good usage” by Claude Favre de Vaugelas in 1647:


These are not just laws I made for our language based on some personal prerogative of mine. That would be reckless, some would say insane, because what authority, what basis do I have for claiming a privilege that is the sole right of Usage – the power that everyone recognizes as the Lord and Master of modern languages?

Vaugelas’ point – the reason people bought his book – was not to base these laws on all usage, but on “good usage,” le bon Vsage, which he explicitly defined as the usage of the members of King Louis XIV’s court. His book contained “descriptive advice” for people who were already literate in French – and thus presumably upwardly mobile – and wanted to write like courtiers so that they would fit in better, and maybe even be admired, at court. Write like these people and you’ll get ahead.

Somewhere along the line Vaugelas’ bon Vsage became Sokolowski’s “standards of good English.” The goal is still to write like these people and get ahead – Sokolowski tweeted, “I bet [Hall’s] kids speak good English.” I bet, but I doubt they needed any descriptive advice to do it. They spoke good English because they were raised as members of the elite. Sokolowski’s job as an editor at Merriam-Webster is to describe the writing of the elites and make prescriptions (aka descriptive advice) that upwardly mobile people can follow when they want to fit in.

The main difference between France in 1647 and the United States in 2013 is that there’s no explicit reference to a court. There are still elites, and people are still striving to fit in with them, but the old court all went to the guillotine, so nobody wants to name the new court. Instead they just handwave in the direction of “standards.”

If we’re using this definition of “descriptivist” – someone who describes the way elites talk and sells that descriptive advice to strivers – then my descriptivist chemist is not accurate. I think that’s a perfectly valid definition of “descriptivist” and I’m not judging (even if I am teasing a little) – I may be looking for a job doing that at some point.

I think it is important for linguists to be clear when we are actually attempting to describe language objectively as scientists, when we are advising descriptively, when we are humbly prescribing language with a political goal in mind, and when we’re being the kind of crotchety traditionalists that Vaugelas thought were insane back in 1647.

Illustration: Joseph M. Gleeson

Just so stories in French negation

Just So stories were named by Rudyard Kipling in his book of the same name, which contained stories like “How the Rhinoceros Got his Skin.” In that one, the rhino’s skin starts out tight, but after he takes it off to swim, a man put crumbs in it to take revenge for the rhino eating his cake. When the rhino put his skin back on, it itched so much he loosened it up with all his scratching. Presumably something similar happened with basset hounds.

These stories can be fun, especially for kids who ask “why?” and won’t take “I don’t know” for an answer. They’re entertaining, but they’re not science and they’re not history. Even if they’re broadly consistent with a scientific theory, if they’re not based on actual data, they’re just fiction.

This is different from the normal simplification that happens in scientific explanations. We know that the Earth is not a perfectly round sphere, that it bulges out a little at the equator. Sometimes it’s enough to think of the world as round, and nobody needs to worry about oblate spheroids.

The main difference is that scientific simplification removes distracting detail from the raw data to allow the bigger picture to be seen more clearly, but Just So stories add detail that doesn’t exist in the data, and may actually create a picture that doesn’t exist. This is why, as science, they are so dangerous.

Linguistics is certainly no stranger to Just So stories. The most famous may be the old chestnut that the Eskimos have a hundred (or a thousand, or…) words for snow. This has long been used to illustrate the effect of environment on language, even though Geoffrey Pullum famously showed it to be false in 1989.

Just So stories are also found in the history of French negation, the subject of my dissertation. There is a story that you will find in almost every article or book discussing the evolution of negation. Here’s the version from Detges and Waltereit (2002):

As a standard example of grammaticalization, consider the French negation ne … pas. A lexical item, the Latin full noun passus ‘step’, has turned into a grammatical item, the Modern French negation marker pas.

(3) a. Before grammaticalization: Latin
non vado   passum
NEG go:lsG step:ACC
'I don't go a step'

b. After grammaticalization: Modern French
je ne vais   pas
'I don't go'

Reading this, I assumed that Detges and Waltereit have some attestations of non vado passum in Latin. That’s the way science works, and history. We do experiments to collect data, and we base our stories of the past on documents and artifacts. In historical linguistics we have what people wrote, and we have reconstructions. Because the reconstructions are less reliable as evidence, we mark them with asterisks.

I was all ready to repeat this story as I told the history of French negation. In fact, one of my professors suggested that I look for evidence of pas being initially restricted to verbs of motion, then gradually used with a broader and broader range of verbs. I did look, but I discovered that it’s just a story. We don’t have any evidence that anyone ever wrote non vado passum, other than linguists talking about grammaticization.

What I did find was this excellent three-part opus on Romance negation by Alfred Schweighäuser, published in 1851-52, digitized to PDF by Google Books and extracted for your convenience here (section 1, section 2, section 3). In section 3 (Part 2), he takes you on a very thorough tour of all the expressions that have been used to “supplement” negation in Latin and its descendants over the years. After spending some time discussing ne … pas, he concludes:

Observons toutefois que cette modification apportée au sens du mot pas est antérieure aux plus anciens monuments de la langue. Si haut que nous remontions dans le cours des siècles, les textes ne nous montrent jamais cette négation explétive que privée de l’article, et jointe indiféremment à des verbes de toute signification.

Let us note in any case that this modification made to the sense of the word pas is earlier than the most remote works of the language. No matter how far back we look across the centuries, the texts only show us that negation shorn of its article and combined indifferently with verbs from any semantic field.

One thing I find remarkable about this is that these aspects of language change were known and studied 161 years ago. And yet it was only a year later, in 1853, that P.L.J.B. Gaussin gave us our first citation of non vado passum:

Nous avons encore à parler d’une dernière modification que quelques mots subissent : elle a lieu lorsque, par suite d’un emploi très-fréquent, ils ne deviennent que de simples formes grammaticales. C’est un fait que nous aurons l’occasion de vérifier en polynésien ; nous en trouvons d’ailleurs de nombreux exemples dans nos langues d’Europe : on connaît l’origine des négations françaises pas et point ; on a d’abord dit non vado passum ou passu, je ne vais d’un pas ; non video punctum, je ne vois un point. Pas et point, par un usage devenu de plus en plus général, n’ont plus été par la suite que de simples signes grammaticaux.

We have yet to discuss one last modification that certain words undergo. It happens when, in the course of very frequent usage, they are transformed into simple grammatical forms. This is a fact that we will have the opportunity to confirm in Polynesian; we also find many examples in our European languages. We know the origin of the French negations pas and point: people first said non vado passum or passu, I am not going one step, non video punctum, I do not see one point. Pas and point, by virtue of more and more general usage, have become nothing more than simple grammatical signs.

Schweighäuser and Gaussin perfectly illustrate the difference between history and Just So stories. Schweighäuser combs through Latin and Old French texts in detail to find all the different ways that the words are used. His wealth of detail is perfectly appropriate for his task, but the story could be told to outsiders in a compelling way by simply omitting some of that detail. There are many examples of this kind of semantic broadening with other constructions; those could have been used instead. But Gaussin doesn’t do that. He just makes stuff up.

It is obviously silly to single out Detges and Waltereit for this Just So story, since it came from Gaussin, and has been handed down ever since. But other than a brief mention in 1907, it was dormant until Lüdtke (1980) revived it. It seems to have been most widely propagated by Paolo Ramat in 1987.

Looking back on this, I appreciate my professor’s invitation to re-examine this story rather than simply repeating it. We should do that with all of our standard stories, to find out which ones are supported by the data, and which are Just So.