Le Corpus de la scène parisienne

C’est l’année 1810, et vous vous promenez sur les Grands Boulevards de Paris. Vous avez l’impression que toute la ville, voir même toute la France, a eu la même idée, et est venue pour se promener, pour voir les gens et se faire voir. Qu’est-ce que vous entendez?

Vous arrivez à un théâtre, vous montrez un billet pour une nouvelle pièce, et vous entrez. La pièce commence. Qu’est-ce que vous entendez de la scène? Quels voix, quel langage?

Le projet du Corpus de la scène parisienne cherche à répondre à cette dernière question, avec l’idée que cela nous informera sur la première question aussi. Il s’appuie sur les travaux du chercheur Beaumont Wicks et des ressources comme Google Books et le projet Gallica de la Bibliothèque Nationale de France pour créer un corpus vraiment représentatif du langage du théâtre parisien.

Certains corpus sont construits à base d’une «principe d’autorité», qui tend à mettre les voix des aristocrates et des grands bourgeois au premier plan. Le Corpus de la Scène Parisienne corrige ce biais par se baser sur une échantillon tirée au sort. En incorporant ainsi le théâtre populaire, le Corpus de la Scène Parisienne permet au langage des classes ouvrières, dans sa représentation théâtrale, de prendre sa place dans le tableau linguistique de cette période.

La première phase de construction, qui couvre les années 1800 à 1815, a déjà contribué à la découverte des résultats intéressants. Par exemple, dans le CSP en 75% des négations de phrase on utilise la construction ne … pas, mais dans les quatre pièces de théâtre qui font partie du corpus FRANTEXT de la même période, on n’utilise ne … pas qu’en 49% des négations de phrase.

En 2016 j’ai créé un dépôt sur GitHub et commencé à y mettre les textes de la première phase en format HTML. Vous pouvez en lire pour vous amuser (Jocrisse-Maître et Jocrisse-Valet en particulier m’a amusé), les mettre sur scène (j’achèterai des places) ou bien les utiliser pour vos propres recherches. Peut-être vous voudriez aussi contribuer au dépôt, par corriger des erreurs dans les textes, ajouter de nouveaux textes du catalogue, ou convertir les textes en de nouveaux formats, comme TEI ou Markdown.

En janvier 2018 j’ai créé le bot spectacles_xix sur Twitter. Chaque jour il diffuse les descriptions des pièces qui ont débuté ce jour-là il y a exactement deux cents ans.

N’hésitez pas à utiliser ce corpus dans vos recherches, mais je vous prie de ne pas oublier de me citer, ou même me contacter pour discuter des collaborations éventuelles!

On this day in Parisian theater

Since I first encountered The Parisian Stage, I’ve been impressed by the completeness of Beaumont Wicks’s life’s work: from 1950 through 1979 he compiled a list of every play performed in the theaters of Paris between 1800 and 1899. I’ve used it as the basis for my Digital Parisian Stage corpus, currently a one percent sample of the first volume (Wicks 1950), available in full text on GitHub.

Last week I had an idea for another project. Science requires both qualitative and quantitative research, and I’ve admired Neil Freeman’s @everylotnyc Twitter bot as a project that conveys the diversity of the underlying data and invites deep, qualitative exploration.

In 2016, with Timm Dapper, Elber Carneiro and Laura Silver I forked Freeman’s everylotbot code to create @everytreenyc, a random walk through the New York City Parks Department’s 2015 street tree census. Every three hours during normal New York active time, the bot tweets information about a tree from the database, in a template written by Laura that may also include topical, whimsical sayings.

Recently I’ve encountered a lot of anniversaries. A lot of it is connected to the centenary of the First World War I, but some is more random: I just listened to an episode of la Fabrique de l’histoire about François Mitterrand’s letters to his mistress that was promoted with the fact that he was born in 1916, one hundred years before that episode aired, even though he did not start writing those letters until 1962.

There are lots of “On this day” blogs and Twitter feeds, such as the History Channel and the New York Times, and even specialized feeds like @ThisDayInMETAL. There are #OnThisDay and #otd hashtags, and in French #CeJourLà. The “On this day” feeds have two things in common: they tend to be hand-curated, and they jump around from year to year. For April 13, 2014, the @CeJourLa feed tweeted events from 1849, 1997, 1695 and 1941, in that order.

Two weeks ago I was at the Annual Convention of the Modern Language Association, describing my Digital Parisian Stage corpus, and I realized that in the Parisian Stage there were plays being produced exactly two hundred years ago. I thought of the #OnThisDay feeds and @everytreenyc, and realized that I could create a Twitter bot to pull information about plays from the database and tweet them out. A week later, @spectacles_xix sent out its first automated tweet, about the play la Réconciliation par ruse.

@spectacles_xix runs on Pythonanywhere in Python 3.6, and accesses a MySQL database. It uses Mike Verdone’s Twitter API client. The source is open on GitHub.

Unlike other feeds, including this one from the French Ministry of Culture that just tweeted about the anniversary of the première of Rostand’s Cyrano de Bergerac, this one will not be curated, and it will not jump around from year to year. It will tweet every play that premièred in 1818, in order, until the end of the year, and then go on to 1819. If there is a day when no plays premièred, like January 16, @spectacles_xix will not tweet.
I have a couple of ideas about more features to add, so stay tuned!

How Google’s Pixel Buds will change the world!

Scene: a quietly bustling bistro in Paris’s 14th Arrondissement.

SERVER: Oui, vous désirez?
PIXELBUDS: Yes, you desire?
TOURIST: Um, yeah, I’ll have the steak frites.
SERVER: Que les frites?
PIXELBUDS: Than fries?
TOURIST: No, at the same time.
SERVER: Alors, vous voulez le steak aussi?
PIXELBUDS: You want the steak too?
TOURIST: Yeah, I just ordered the steak.
SERVER: Okay, du steak, et des frites, en même temps.
PIXELBUDS: Okay, steak, and fries at the same time.
TOURIST: You got it.

(All translations by Google Translate. Photo: Alain Bachelier / Flickr.)

The Digital Parisian Stage is now on GitHub

For the past five years I’ve been working on a project, the Digital Parisian Stage, that aims to create a representative sample of Nineteenth-century Parisian theater. I’ve made really satisfying progress on the first stage, 1800 through 1815, which corresponds to the first volume of Charles Beaumont Wicks’s catalog, the Parisian Stage (1950). Of the initial one-percent sample (31 plays), I have obtained 24, annotated 15 and discarded three for length, for a current total of twelve plays.

At conferences like the Keystone Digital Humanities Conference and the American Association for Corpus Linguistics, I’ve presented results showing that these twelve plays cover a much wider and more innovative range of language than the four theatrical plays from this period in the FRANTEXT corpus, a sample drawn fifty years ago based on a “principle of authority.”

Just looking at declarative sentence negation, I found that in the FRANTEXT corpus the playwrights negate declarative sentences with the ne … pas construction 49 percent of the time. In the twelve randomly sampled plays, the playwrights used ne … pas 75 percent of the time to negate declarative sentences. Because this was a representative sample, I even have a p value below 0.01, based on a chi-square goodness of fit test!

This seems like a good point to release the twelve texts that I have OCRed and cleaned to the public. I have uploaded them to GitHub as HTML files. In this I have been partly inspired by the work of Alex Gil, now my colleague at Columbia University.

You can read them for your own entertainment (Jocrisse-maître et Jocrisse-valet is my favorite), stage your own production of them (I’ll buy tickets!) or use them as data for your scientific investigations. I hope that you will also consider contributing to the repository, by checking for errors in the existing texts, adding new texts from the catalog, or converting them to a different format like TEI or Markdown.

If you do use them in your own studies, please don’t forget to cite me along the lines given below, or even to contact me to discuss co-authorship!

Grieve-Smith, Angus B. (2016). The Digital Parisian Stage Corpus. GitHub. https://github.com/grvsmth/theatredeparis

“Said” for 2016 Word of the Year

I just got back from the American Association for Corpus Linguistics conference in Ames, Iowa, and I’m calling the Word of the Year: for 2016 it will be said.

You may think you know said. It’s the past participle of say. You’ve said it yourself many times. What’s so special about it?

What’s special was revealed by Jordan Smith, a graduate student at Iowa State, in his presentation on Saturday afternoon. said is becoming a determiner. It is grammaticizing.

In addition to its participial use (“once the words were said”) you’ve probably seen said used as an attributive adjective (“the said property”). It indicates that the noun it modifies refers to a person, place or thing that has been mentioned recently, with the same noun, and that the speaker/writer expects it to be active in the hearer/reader’s memory.

Attributive said is strongly associated with legal documents, as in its first recorded use in the English Parliament in 1327. The Oxford English Dictionary reports that said was used outside of legal contexts as early as 1973, in the English sitcom Steptoe and Son. In this context it was clearly a joke: a word that evoked law courts used in a lower-class colloquial context.

Jordan Smith examined uses of said in the Corpus of Contemporary American English (COCA) and found that attributive said has increasingly been used without the for several years now, and outside the legal domain. He observes that syntactic changes and increased frequency have been named by linguists like Joan Bybee as harbingers of grammaticization.

Grammaticization (also known as grammaticalization; search for both) is when an ordinary lexical item (like a noun, verb or adjective, or even a phrase) becomes a grammatical item (like a pronoun, preposition or auxiliary verb). For example, while is a noun meaning a period of time, but it was grammaticized to a conjunction indicating simultaneity. Used is an adjective meaning accustomed, as in “I was used to being lonely,” but has also become part of an auxiliary indicating habitual aspect as in “I used to be lonely.”

Jordan is suggesting that said is no longer just a verb or even an adjective, it’s our newest determiner in English. Determiners are an exclusive club of short words that modify nouns. They include articles like an and the, but also demonstratives like these and quantifiers like several.

Noun phrases without a determiner tend to refer to generic categories, as I have been doing with phrases like legal documents and grammaticization. That is clearly not what is going on with said girlfriend. Noun phrases with said refer to a specific item or group of items, in some sense even more so than noun phrases with the.

Thanks to the wireless Internet at the AACL, I began searching for of said on Twitter, and found a ton of examples. There are plenty for in said examples as well.

It’s not just happening in English. The analogous French ledit is also used outside the legal domain. Its reanalysis is a bit different, since it incorporates the article rather than replacing it. Like most noun modifiers in French it is inflected for gender and number. I haven’t found anything similar for Spanish.

In 2013 the American Dialect Society chose because as its Word of the Year. Because is already a conjunction, having grammaticized from the noun cause, but it has been reanalyzed again into a preposition, as in because science. Some theorists consider this to be a further step in grammaticization. And here is a twenty-first century prepositional phrase for you, folks: because (P) said (Det) relationship (N).

After Jordan’s presentation it struck me that said is an excellent candidate for the 2016 Word of the year. And if the ADS isn’t interested, maybe another organization like the International Cognitive Linguistics Association, can sponsor a Grammaticization of the Year.

Printing differences and material issues in Google Books

I am looking forward to presenting my Digital Parisian Stage corpus and the exciting results I’ve gotten from it so far at the American Association for Corpus Linguistics at Iowa State in September. In the meantime I’m continuing to process texts, working towards a one percent sample from the Napoleonic period (Volume 1 of the Wicks catalog).

One of the plays in my sample is les Mœurs du jour, ou l’école des femmes, a comedy by Collin-Harleville (also known as Jean-François Collin d’Harleville). I ran the initial OCR on a PDF scanned for the Google Books project. For reasons that will become clear, I will refer to it by its Google Books ID, VyBaAAAAcAAJ. When I went to clean up the OCR text, I discovered that it was missing pages 2-6. I emailed the Google Books team about this, and got the following response:


I’m guessing “a material issue” means that those pages were missing from the original paper copy, but I didn’t even bother emailing until the other day, since I found another copy in the Google Books database, with the ID kVwxUp_LPIoC.

Comparing the OCR text of VyBaAAAAcAAJ with the PDF of kVwxUp_LPIoC, I discovered some differences in spelling. For example, throughout the text, words that end in the old fashioned spelling -ois or -oit in VyBaAAAAcAAJ are spelled with the more modern -ais in kVwxUp_LPIoC. There is also a difference in the way “Madame” is abbreviated (“Mad.” vs. “M.me“) and in which accented letters preserve their accents when set in small caps, and differences in pagination. Here is the entirety of Act III, Scene X in each copy:


Act III, Scene X in copy VyBaAAAAcAAJ

Act III, Scene X in kVwxUp_LPIoC

Act III, Scene X in copy kVwxUp_LPIoC

My first impulse was to look at the front matter and see if the two copies were identified as different editions or different printings. Unfortunately, they were almost identical, with the most notable differences being that VyBaAAAAcAAJ has an œ ligature in the title, while kVwxUp_LPIoC is signed by the playwright and marked as being a personal gift from him to an unspecified recipient. Both copies give the exact same dates: the play was first performed on the 7th of Thermidor in year VIII and published in the same year (1800).

The Google Books metadata indicate that kVwxUp_LPIoC was digitized from the Lyon Public Library, while VyBaAAAAcAAJ came from the Public Library of the Netherlands. The other copies I have found in the Google Books database, OyL1oo2CqNIC from the National Library of Naples and dPRIAAAAcAAJ from Ghent University, appear to be the same printing as kVwxUp_LPIoC, as does the copy from the National Library of France.

Since the -ais and M.me spellings are closer to the forms used in France today, we might expect that kVwxUp_LPIoC and its cousins are from a newer printing. But in Act II, Scene XI I came across a difference that concerns negation, the variable that I have been studying for many years. The decadent Parisians Monsieur Basset and Madame de Verdie question whether marriage should be eternal. Our hero Formont replies that he has no reason not to remain with his wife forever. In VyBaAAAAcAAJ he says, “je n’ai pas de raisons,” while in kVwxUp_LPIoC he says “je n’ai point de raisons.”

Act III, Scene XI (page 75) in VyBaAAAAcAAJ

Act III, Scene XI (page 75) in VyBaAAAAcAAJ

Act III, Scene XI (page 78) in kVwxUp_LPIoC

Act III, Scene XI (page 78) in kVwxUp_LPIoC

In my dissertation study I found that the relative use of ne … point had already peaked by the nineteenth century, and was being overtaken by ne … pas. If this play fits the pattern, the use of the more conservative pattern in kVwxUp_LPIoC goes against the more innovative -ais and M.me spellings.

I am not an expert in French Revolutionary printing (if anyone knows a good reference or contact, please let me know!). My best guess is that kVwxUp_LPIoC is from a limited early run, some copies of which were given to the playwright to give away, while VyBaAAAAcAAJ and the other -ais/M.me/ne … point copies are from a larger, slightly later, printing.

In any case, it is clear that I should pick one copy and make it consistent with that. Since VyBaAAAAcAAJ is incomplete, I will try dPRIAAAAcAAJ. I will try to double-check all the spellings and wordings, but at the very least I will check all of the examples of negation against dPRIAAAAcAAJ as I annotate them.

Speech role models

John Murphy of Georgia State published an article about using non-native speakers, and specifically the Spanish actor Javier Bardem, as models for teaching English as a Second Language (ESL) or as a foreign language (EFL). Mura Nava tweeted a blog post from Robin Walker connecting Murphy’s work to similar work by Kenworthy and Jenkins, Peter Roach and others. I tried something like this when I taught ESL back in 2010, more or less unaware of all the previous work that Murphy cites, and Mura Nava was interested to know how it went, so here’s the first part of a quick write-up.

When I was asked to teach a class in ESL Speech “Advanced Oral/Aural Communication” at Saint John’s University in the fall of 2010, I had taught French and Linguistics, but I had only tutored English one-on-one. My wife is an experienced professor of ESL and was a valuable source of advice, but our student populations and our goals were different, so I did not simply copy her methods.

One concept that I introduced was that of a Speech Role Model. When I was learning French, I found it invaluable to imitate entertainers; I’ve never met Jacques Dutronc, but I often say that he was one of my best French teachers because of the clever lyricists he worked with and his clear, wry delivery. He was just one of the many French people that I imitated to improve my pronunciation.

This was all back in the days of television and cassettes, and most of the French culture that we had access to here in the United States was filtered through the wine, Proust and Rohmer tastes of American Francophiles. As a geeky kid with a fondness for comedy I found Edith Piaf and even Gérard Depardieu too alien to emulate. I found out about Dutronc in college through a bootleg tape made for me by a student from France who lived down the hall, and then I had to study abroad in France to find more role models.

With today’s multimedia Internet technology, we have an incredible the ability to listen to millions of people from around the world. At Saint John’s I asked my students to choose a Speech Role Model for English: a native speaker that they personally admired and wanted to sound like. I was surprised by the number of students who named President Obama as their role model, including female students from China, but on reflection it was an obvious choice, as he is a clear, forceful and eloquent speaker. Other students chose actresses Meryl Streep and Jennifer Anniston, talk-show host Bill O’Reilly and local newscaster Pat Kiernan.

One notable choice, hip-hop artist Eminem, gave me the opportunity to discuss covert prestige and its challenges. Another, the character of Sheldon Cooper from the television series “The Big Bang Theory,” was too scripted, and I was debating whether to accept it when I discovered that it was just a cover so that the student could plagiarize crowdsourced transcriptions.

In subsequent assignments I asked the students to find a YouTube video of their role model and to transcribe a short excerpt. I then asked the students to record themselves imitating that excerpt from their Speech Role Models. Some of the students were engaged and interested, but others seemed frustrated and discouraged. When I listened to my students and comparing their speech to their chosen role models, I had an idea why. The students who were engaged were either naturally enthusiastic or good mimics, but the challenge was to motivate the others. There was so much distance between them and the native English speakers, much more than could be covered in a semester. That was when I thought of adding a non-native Second Speech Role Model. I’ll have to leave that for another post.