Online learning and intellectual honesty

In January I wrote that I believe online learning is possible, but I have doubts about whether online courses are an adequate substitute for in-person college classes, let alone an improvement. One of those doubts concerns trust and intellectual honesty.

Any course is an exchange. The students pay money to the college, the instructor gets a cut, and the students get something of value in return. What that something is can be disputed. In theory, the teacher gives the students knowledge: information and skills.

In practice, some of the students actually expect to receive knowledge in exchange for their tuition. Some of them want knowledge but have gotten discouraged. Some wouldn’t mind a little knowledge, but that’s not what they’re there for. Others just have no time for actual learning.

If they’re not there for knowledge, why are they there? For credentials. They want a degree, and the things that go with a degree and make it more valuable for getting a good job: a major, a course list, good grades, letters of recommendation, connections.

If learning is not important, or if the credentials are urgent enough, it is tempting to skip the learning, just going through the motions. That means pretending to learn, or pretending that you learned more than you did. Most teachers have encountered this attitude at some point.

I have seen various manifestations of the impulse to cheat in every class I’ve taught over the years. Some people might be tempted to treat it like any other transaction. It is hard to make a living while being completely ethical. I fought it for several reasons.

First, I genuinely enjoy learning and I love studying languages, and I want to share that enjoyment and passion with my students. Second, many of my students have been speech pathology majors. I have experienced speech pathology that was not informed by linguistics, and I know that a person who doesn’t take linguistics seriously is not fit to be a speech pathologist.

If that wasn’t enough, I was simply not getting paid enough to tolerate cheating. At the wages of an adjunct professor, I wasn’t in it for the money. I was doing it to pass on my knowledge and gain experience, and looking the other way while students cheated was not the kind of experience I signed up for.

I’ve seen varying degrees of dishonesty in my years of teaching. In one French claws, a student tried to hand in an essay in Spanish; in his haste he had chosen the wrong option on the machine translation app. I developed strategies for deterring cheating, such as multiple drafts and a focus on proper citation. But I was not prepared for how much cheating I would find when I taught an online course.

The most effective deterrent was simply to get multiple examples of a student’s work: in class discussions, in small group work, in homeworks and on exams. That allowed me to spot inconsistent quality that might turn out to be plagiarism.

In these introductory linguistics courses, the homeworks themselves were minor exercises, mainly for the students to get feedback on whether they had understood the reading. If a student skipped a reading and plagiarized the homework assignment, it would usually be obvious to both of us when we went over the material in class. That would give the student feedback so that they could change their habits before the first exam.

The first term that I taught this course online, I noticed that some students were getting all the answers right on the homeworks. I was suspicious, but I gave the students the benefit of the doubt. Maybe they had taken linguistics in high school, or read some good books.

Then I noticed that the answers were all the same, and I began to notice quirks of language that didn’t fit my students. One day I saw that the answers were all in an unusual font. I googled one of the quirky phrases and immediately found a file of answers to the questions for that chapter.

I started searching around and found answers to every homework in the textbook. These students were simply googling the questions, copying the answers, and pasting them into Blackboard. They weren’t reading and they weren’t discussing the material. And it showed in their test results. But because this was a summer course, they didn’t have time to recover, and they all got bad grades.

I understood where they were coming from. They needed to knock out this requirement for their degree. They didn’t care about linguistics, or if they did, they didn’t have time for it. They wanted to get the work out of the way for this class and then go to their job or their internship or their other classes. Maybe they wanted to go drinking, but I knew these Speech Pathology students well enough to know that they weren’t typically party animals.

I’ve had jobs where I saw shady practices and just went along with it, but in this case I couldn’t do that, for the reasons I gave above. My compensation for this work wasn’t the meager adjunct pay that was deposited in my checking account every two weeks. It was the knowledge that I had passed on some ideas about language to these students. It was also the ability to say that I had taught linguistics, and even online.

The only solution I had to the problem was to write my own homework questions, ones that could be answered online, but where the appropriate answers couldn’t be found with a simple Google search.

The next term I taught the course online I had to deal with students sharing answers – not collaborating in the groups I had carefully constructed so that the student finishing her degree in another state could learn through peer discussion, but where one student simply copied the homework her friend had done. They did it on exams too, where they were supposed to be answering the questions alone. This meant that I also had to come up with questions where the answers were individual and couldn’t be copied.

I worked hard at it. My student evaluations for the online courses were pretty bad for that first summer, and for the next term, and the one after that. But the term after that they were almost as good as the ones for my in-person courses.

Unfortunately, that’s when I had to tell my coordinator that I couldn’t teach any more online courses. Because to teach them right required a lot of time – especially if every assignment has to be protected against students googling the answers or shouting them to each other across the room.

The good news is that in this whole process I learned a ton of interesting things about language and linguistics, and how to teach them. I’ve found that many of the strategies I developed for online teaching are helpful for in-person classes. I’m planning to post about some of them in the near future.

The Photo Roster, a web app for Columbia University faculty

Since July 2016 I have been working as Associate Application Systems in the Teaching and Learning Applications group at Columbia University. I have developed several apps, including this Photo Roster, an LTI plugin to the Canvas Learning Management System.

The back end of the Photo Roster is written in Python and Flask. The front end uses Javascript with jQuery to filter the student listings and photos, and to create a flash card app to help instructors learn their students’ names.

This is the third generation of the Photo Roster tool at Columbia. The first generation, for the Prometheus LMS, was famously scraped by Mark Zuckerberg when he extended Facebook to Columbia. To prevent future release of private student information, this version uses SAML and OAuth2 to authenticate users and securely retrieve student information from the Canvas API, and Oracle SQL to store and retrieve the photo authorizations.

It would be a release of private student information if I showed you the Roster live, so I created a demo class with famous Columbia alumni, and used a screen recorder to make this demo video. Enjoy!

I just have to outrun your theory

The Problem

You’ve probably heard the joke about the two people camping in the woods who encounter a hungry predator. One person stops to put on running shoes. The other says, “Why are you wasting time? Even with running shoes you’re not going to outrun that animal!” The other replies, “I don’t have to outrun the animal, I just have to outrun you.”

For me this joke highlights a problem with the way some people argue about climate change. First of all, spreading uncertainty and doubt against competitors is a common marketing tactic, and as Naomi Orestes and Erik Conway documented in their book Merchants of Doubt, that same tactic has been used by marketers against concerns about smoking, DDT, acid rain and most recently climate change.

In the case of climate change, as with fundamentalist criticisms of evolution, there is a lot of stress on the idea that the climatic models are “only a theory,” and that they leave room for the possibility of error. The whole idea is to deter a certain number of influential people from taking action.

That Bret Stephens Column

The latest example is Bret Stephens, newly hired as an opinion columnist by New York Times editors who should really have known better. Stephens’s first column is actually fine on the surface, as far as it goes, aside from some factual errors: never trust anyone who claims to be 100% certain about anything. Most people know this, so if you claim to be 100% certain, you may wind up alienating some potential allies. And he doesn’t go beyond that; I re-read it several times in case I missed anything.

Since all Stephens did was to say those two things, none of which amount to an actual critique of climate change or an argument that we should not act, the intensely negative reactions it generated may be a little surprising. But it helps if you look back at Stephens’s history and see that he’s written more or less the same thing over and over again, at the Wall Street Journal and other places.

Many of the responses to Stephens’s column have pointed out that if there’s any serious chance of climate change having the effects that have been predicted, we should do something about it. The logical next step is talking about possible actions. Stephens hasn’t talked about any possible actions in over fifteen years, which is pretty solid evidence of concern trolling: he pretends to be offering constructive criticism while having no interest in actually doing anything constructive. And if you go all the way back to a 2002 column in the Jerusalem Post, you can see that he was much more overtly critical in the past.

Stephens is very careful not to recommend any particular course of action, but sometimes he hints at the potential costs of following recommendations based on the most widely accepted climate change models. Underlying all his columns is the implication that the status quo is just fine: Stephens doesn’t want to do anything to reduce carbon emissions. He wants us to keep mining coal, pumping oil and driving everywhere in single-occupant vehicles.

People are correctly discerning Stephens’s intent: to spread confusion and doubt, disrupting the consensus on climate change and providing cover for greedy polluters and ideologues of happy motoring. But they play into his trap, responding in ways that look repressive, inflexible and intolerant. In other words, Bret Stephens is the Milo Yiannopoulos of climate change.

The weak point of mainstream science

Stephens’s trolling is particularly effective because he exploits a weakness in the way mainstream scientists handle theories. In science, hypotheses are predictions that can be tested and found to be true or false: the hypothesis that you can sail around the world was confirmed when Juan Sebastián Elcano completed Magellan’s expedition.

Many people view scientific theories as similarly either true or false. Those that are true – complete and consistent models of reality – are valid and useful, but those that are false are worthless. For them, Galileo’s measurements of the movements of the planets demonstrated that the heliocentric model of the solar system is true and the model with the earth at the center is false.

In this all-or-nothing view of science, uncertainty is death. If there is any doubt about a theory, it has not been completely proven, and is therefore worthless for predicting the future and guiding us as we decide what to do.

Trolls like Bret Stephens and the Marshall Institute exploit this intolerance of uncertainty by playing up any shred of doubt about climate change. And there are many such doubts, because this is actually the way science is supposed to work: highlighting uncertainty and being cautious about results. Many people respond to them in the most unscientific ways, by downplaying doubts and pointing to the widespread belief in climate change among scientists.

The all-or-nothing approach to theories is actually a betrayal of the scientific method. The caution built into the gathering of scientific evidence was not intended as a recipe for paralysis or preparation for popularity contests. There is a way to use cautious reports and uncertain models as the basis for decisive action.

The instrumental approach

This approach to science is called instrumentalism, and its core principles are simple: theories are never true or false. Instead, they are tools for understanding and prediction. A tool may be more effective than another tool for a specific purpose, but it is not better in any absolute sense.

In an instrumentalist view, when we find fossils that are intermediate between species it does not demonstrate that evolution is true and creation is false. Instead, it demonstrates that evolution is a better predictor of what we will find underground, and produces more satisfying explanations of fossils.

Note that when we evaluate theories from an instrumental perspective, it is always relative to other theories that might also be useful for understanding and predicting the same data. Like the two people running from the wild animal, we are not comparing theories against some absolute standard of truth, but against each other.

In climate change, instrumentalism simply says that certain climate models have been better than others at predicting the rising temperature readings and melting glaciers we have seen recently. These models suggest that it is all the driving we’re doing and the dirty power plants we’re running that are causing these rising temperatures, and to reduce the dangers from rising temperatures we need to reconfigure our way of living around walking and reducing our power consumption.

Evaluating theories relative to each other in this way takes all the bite out of Bret Stephens’s favorite weapon. He never makes it explicit, but he does have a theory: that we’re not doing much to raise the temperature of the planet. If we make his theory explicit and evaluate it against the best climate change models, it sucks. It makes no sense of the melting glaciers and rising tides, and has done a horrible job of predicting climate readings.

We can fight against Bret Stephens and his fellow merchants of doubt. But in order to do that, we need to set aside our greatest weakness: the belief that theories can be true, and must be proven true to be the basis for action. We don’t have to outrun Stephens’s uncertainty; we just have to outrun his love of the status quo. And instrumentalism is the pair of running shoes we need to do that.

Online learning: Definitely possible

There’s been a lot of talk over the past several years about online learning. Some people sing its praises without reservation. Others claim that it doesn’t work at all. I have successfully learned over the internet and I have successfully taught over the internet. It can work very well, but it requires a commitment on the part of the teacher and the learner that is not always present. In this series of posts I will discuss what has worked well and what hasn’t in my experience, specifically in teaching linguistics to undergraduate speech pathology majors.

Online learning is usually contrasted with an ideal classroom model where the students engage in two-way oral conversation, exercises and assessment with the instructor and each other, face to face in real time. In practice there are already deviations from this model: one-way lectures, independent and group exercises, asynchronous homeworks, take-home exams. The questions are really whether the synchronous or face-to-face aspects can be completely eliminated, and whether the internet can provide a suitable medium for instruction.

The first question was answered hundreds of years ago, when the first letter was exchanged between scholars. Since then people have learned a great deal from each other, via books and through the mail. My great-uncle Doc learned embalming through a correspondence course, and made a fortune as one of the few providers of Buddhist funerals in San Jose. So we know that people can learn without face-to-face, synchronous or two-way interaction with teachers.

What about the internet? People are learning a lot from each other over the internet. I’ve learned how to assemble a futon frame and play the cups over the internet. A lot of the core ideas about social science that inform my work today I learned in a single independent study course I took over email with Melissa Axelrod in 1999.

My most dramatic exposure to online learning was from 2003 through 2006. I read the book My Husband Betty, and discovered that the author, Helen Boyd, had an online message board for readers to discuss her book (set up by Betty herself). The message board would send me emails whenever someone posted, and I got drawn into a series of discussions with Helen and Betty, as well as Diane S. Frank, Caprice Bellefleur, Donna Levinsohn, Sarah Steiner and a number of other thoughtful, creative, knowledgeable people.

A lot of us knew a thing or two about gender and sexuality already, but Helen, having read widely and done lots of interviews on those topics, was our teacher, and would often start a discussion by posting a question or a link to an article. Sometimes the discussion would get heated, and eventually I was kicked off and banned. But during those three years I learned a ton, and I feel like I got a Master’s level education in gender politics. Of course, we didn’t pay Helen for this besides buying her books, so I’m glad she eventually got a full-time job teaching this stuff.

So yes, we can definitely learn things over the internet. But are official online courses an adequate substitute for – or even an improvement over – in-person college classes? I have serious doubts, and I’ll cover them in future posts.

Category fights: Splitting

Imagine that you belong to a category, like “tourist.” You fit all the necessary conditions for membership in that category: you are traveling to another part of the world for recreation. But that category has a bad reputation – literally a bad name. What do you do? You split the category.

In the past I’ve talked about other kinds of category fights: watchdogging alleged bait-and-switch tactics, or gatekeeping to prevent free-riding. Tonight I’m going to talk about splitting.

I grew up in the lovely arts colony of Woodstock, New York, which is crowded every summer and fall with tourists. They never bothered me too much, and they bought lots of stuff so that the merchants could afford to hire my parents, but my family and neighbors liked to complain about them. They drove too fast on our country roads, possibly contributing to the death of some of our dogs over the years. They filled up the parking lots and caused traffic jams on Mill Hill Road. They asked annoying questions – where was Yasgur’s farm? They were demanding and unreasonable to my sister and friends who worked in retail.

In terms of non-Platonic categories, there is a wide diversity of actual tourists, but the category is dominated in people’s minds by a stereotype of the Tourist, who is entitled, disrespectful, and lacks a proper appreciation for the people they are visiting and their culture. All tourists are tainted by the stereotype of the Tourist, but some people do pride themselves on being respectful, humble, open and curious. What can they do to advertise that to others?

As Lara Week documented in a study of several blogs in 2012, and described to Laurie Taylor on his Thinking Allowed podcast, one thing you can do is to split the category. A number of people have chosen to call themselves “travelers” instead of “tourists.” Week reports that they distinguish themselves by “doing what the locals do,” “respecting local cultures” and “being frugal,” and have added features like “seeking authenticity” and “going to ‘untraveled’ places.” She goes on to summarize critiques that argue that the self-styled travelers have “fail to address all of the problems created by tourism,” but that is not directly relevant to the linguistic issues here.

The travelers, notably, split the category of “tourist” so that they are outside of it. They have concluded that the category is irredeemably contaminated, and their only hope is to escape it. In contrast, as Ben Zimmer reported last year, a number of people have tried to split the category of “pedestrian,” keeping the stereotype of pedestrians clean by placing people who text while walking into subcategories of “petextrians,” or “wexters.”

The cleanliness of the stereotype is one factor in determining whether people choose to split themselves off into another category or to split others off. It also determines whether people try to split themselves (or others) into a subcategory or into a completely new category. Another factor is how rigidly the category is defined. It is very hard to leave the category of “men,” so some men who feel that the stereotype is contaminated have responded with the #notallmen hashtag, trying to reclaim it by splitting the bad men into a subcategory.

African American English has accents too

Diversity is notoriously subjective and difficult to pin down. In particular, we tend be impressed if we know the names of a lot of categories for something. We might think there are more mammal species than insect species, but biologists tell us that there are hundreds of thousands of species of beetles alone. This is true in language as well: we think of the closely-related Romance and Germanic languages as separate, while missing the incredible diversity of “dialects” of Chinese or Arabic.

This is also true of English. As an undergraduate I was taught that there were four dialects in American English: New England, North Midland, South Midland and Coastal Southern. Oh yeah, and New York and Black English. The picture for all of those is more complicated than it sounds, and I went to Chicago I discovered that there are regional varieties of African American English.

In 2012 Annie Minoff, a blogger for Chicago public radio station WBEZ, took this oversimplification for truth: “AAE is remarkable for being consistent across urban areas; that is, Boston AAE sounds like New York AAE sounds like L.A. AAE, etc.” Fortunately a commenter, Amanda Hope, challenged her on that assertion. Minoff confirmed the pattern in an interview with variationist Walt Wolfram, and posted a correction in 2013.

In 2013 I was preparing to teach a unit on language variation and didn’t want to leave my students as misinformed as I – or Minoff – had been. Many of my students were African American, and I saw no reason to spend most of the unit on white varieties and leave African American English as a footnote. But the documentation is spotty: I know of no good undergraduate-level discussion of variation in African American English.

A few years before I had found a video that some guy took of a party in a parking lot on the West Side of Chicago. It wasn’t ideal, but it sort of gave you an idea. The link was dead, so I typed “Chicago West Side” into Google. The results were not promising, so on a whim I added “accent” and that’s how I found my first accent tag video.

Accent tag videos are an amazing thing, and I could write a whole series of posts about them. Here was a young black woman from Chicago’s West Side, not only talking about her accent but illustrating it, with words and phrases to highlight its differences from other dialects. She even talks (as many people do in these videos) about how other African Americans hear her accent in other places, like North Carolina. You can compare it (as I did in class) with a similar video made by a young black woman from Raleigh (or New York or California), and the differences are impossible to ignore.

In fact, when Amanda Hope challenged Minoff’s received wisdom on African American regional variation, she used accent tag videos to illustrate her point. These videos are amazing, particularly for teaching about language and linguistics, and from then on I made extensive use of them in my courses. There’s also a video made by two adorable young English women, one from London and one from Bolton near Manchester, where you can hear their accents contrasted in conversation. I like that I can go not just around the country but around the world (Nigeria, Trinidad, Jamaica) illustrating the diversity of English just among women of African descent, who often go unheard in these discussions. I’ll talk more about accent tag videos in future posts.

You can also find evidence of regional variation in African American English on Twitter. Taylor Jones has a great post about it that also goes into the history of African American varieties of English.

Is your face red?

In 1936, Literary Digest magazine made completely wrong predictions about the Presidential election. They did this because they polled based on a bad sample: driver’s licenses and subscriptions to their own magazine. Enough people who didn’t drive or subscribe to Literary Digest voted, and they voted for Roosevelt. The magazine’s editors’ faces were red, and they had the humility to put that on the cover.

This year, the 538 website made completely wrong predictions about the Presidential election, and its editor, Nate Silver, sorta kinda took responsibility. He had put too much trust in polls conducted at the state level. They were not representative of the full spectrum of voter opinion in those states, and this had skewed his predictions.

Silver’s face should be redder than that, because he said that his conclusions were tentative, but he did not act like it. When your results are so unreliable and your data is so problematic, you have no business being on television and in front-page news articles as much as Silver has.

In part this attitude of Silver’s comes from the worldview of sports betting, where the gamblers know they want to bet and the only question is which team they should put their money on. There is some hedging, but not much. Democracy is not a gamble, and people need to be prepared for all outcomes.

But the practice of blithely making grandiose claims based on unrepresentative data, while mouthing insincere disclaimers, goes far beyond election polling. It is widespread in the social sciences, and I see it all the time in linguistics and transgender studies. It is pervasive in the relatively new field of Data Science, and Big Data is frequently Non-representative Data.

At the 2005 meeting of the American Association for Corpus Linguistics there were two sets of interactions that stuck with me and have informed my thinking over the years. The first was a plenary talk by the computer scientist Ken Church. He described in vivid terms the coming era of cheap storage and bandwidth, and the resulting big data boom.

But Church went awry when he claimed that the size of the datasets available, and the computational power to analyze them, would obviate the need for representative samples. It is true that if you can analyze everything you do not need a sample. But that’s not the whole story.

A day before Church’s talk I had had a conversation over lunch with David Lee, who had just written his dissertation on the sampling problems in the British National Corpus. Lee had reiterated what I had learned in statistics class: if you simply have most of the data but your data is incomplete in non-random ways, you have a biased sample and you can’t make generalizations about the whole.

I’ve seen this a lot in the burgeoning field of Data Science. There are too many people performing analyses they don’t understand on data that’s not representative, making unfounded generalizations. As long as these generalizations fit within the accepted narratives, nobody looks twice.

We need to stop making it easier to run through the steps of data analysis, and instead make it easier to get those steps right. Especially sampling. Or our faces are going to be red all the time.

The Digital Parisian Stage is now on GitHub

For the past five years I’ve been working on a project, the Digital Parisian Stage, that aims to create a representative sample of Nineteenth-century Parisian theater. I’ve made really satisfying progress on the first stage, 1800 through 1815, which corresponds to the first volume of Charles Beaumont Wicks’s catalog, the Parisian Stage (1950). Of the initial one-percent sample (31 plays), I have obtained 24, annotated 15 and discarded three for length, for a current total of twelve plays.

At conferences like the Keystone Digital Humanities Conference and the American Association for Corpus Linguistics, I’ve presented results showing that these twelve plays cover a much wider and more innovative range of language than the four theatrical plays from this period in the FRANTEXT corpus, a sample drawn fifty years ago based on a “principle of authority.”

Just looking at declarative sentence negation, I found that in the FRANTEXT corpus the playwrights negate declarative sentences with the ne … pas construction 49 percent of the time. In the twelve randomly sampled plays, the playwrights used ne … pas 75 percent of the time to negate declarative sentences. Because this was a representative sample, I even have a p value below 0.01, based on a chi-square goodness of fit test!

This seems like a good point to release the twelve texts that I have OCRed and cleaned to the public. I have uploaded them to GitHub as HTML files. In this I have been partly inspired by the work of Alex Gil, now my colleague at Columbia University.

You can read them for your own entertainment (Jocrisse-maître et Jocrisse-valet is my favorite), stage your own production of them (I’ll buy tickets!) or use them as data for your scientific investigations. I hope that you will also consider contributing to the repository, by checking for errors in the existing texts, adding new texts from the catalog, or converting them to a different format like TEI or Markdown.

If you do use them in your own studies, please don’t forget to cite me along the lines given below, or even to contact me to discuss co-authorship!

Grieve-Smith, Angus B. (2016). The Digital Parisian Stage Corpus. GitHub. https://github.com/grvsmth/theatredeparis

Nobody’s Boy

I got a paper rejected from a generativist conference a few years ago. A generativist friend of mine said, “Why did you bother submitting your paper to that conference? You knew they were going to reject it.” I said, “Well, the conference was in town, so I figured I’d send something in anyway.”

My friend proceeded to tell me a story from her early grad school days about reviewing papers for her school’s signature conference. She sat down one evening with Professor Big Deal, who glanced through the stack of anonymous submissions and sorted them one by one into piles. “This is from one of Professor X’s students, and this is from one of Professor Y’s students. Here’s another from Professor X’s group. This must be Professor Z.” She continued like this until all the papers were sorted, and then as I recall she had some formula for allocating time to each professor and their students.

I think about this a lot, because I’m not a Student Of anyone in particular. On paper I may look like a student of Professor Bigshot, and that’s probably how my paper got accepted to a conference where Professor Bigshot was a keynote speaker. But I’m not really a Student Of Professor Bigshot. I didn’t ask her to be on my committee. And I know she doesn’t think of me as a Student Of hers, because she was sitting in front of me later in that conference, and walked out of the room right before it was my turn to present my paper.

My relationship with my actual advisor is Complicated, but suffice it to say that we don’t work in the same subfield of linguistics, and I’m tied to the New York area, where she doesn’t have the pull to get me a job anyway. My relationships with my other committee members are problematic in various ways. I’m on good terms with plenty of other linguists, but since I’m not their Student their loyalty to me is always secondary.

Even if my friend’s story about Professor Big Deal is an egregious outlier, it is still a regular occurrence to see professors co-authoring and co-presenting papers with their students, making introductions and writing letters. If you know me professionally, I can pretty much guarantee that we were not introduced by Professor Bigshot, or by any member of my committee. If you’ve seen me present my research, or read it anywhere, or hired me, it’s entirely through my own hard work. I have not had any of the advantages that come with being a Student Of anyone.

You could say that it’s my fault for not choosing the right advisors, or for the problems in my relationships with my advisors. In my defense I would argue that most of the problems in these relationships had to do with my supporting my wife’s progress on the tenure track and my kid’s not being in daycare ten hours a day over my own progress on the PhD. But even if you disagree, does that mean that I deserve to be a second-class citizen in the field?

I know I’m not the only academic orphan out there. Maybe we should get together and found a Home for Orphaned Linguists, where we can hope to someday be adopted by professors with generous allocations of reassigned time, who will co-author with us and introduce us and attend our talks. Some day…

Sampling is a labor-saving device

Last month I wrote those words on a slide I was preparing to show to the American Association for Corpus Linguistics, as a part of a presentation of my Digital Parisian Stage Corpus. I was proud of having a truly representative sample of theatrical texts performed in Paris between 1800 and 1815, and thus finding a difference in the use of negation constructions that was not just large but statistically significant. I wanted to convey the importance of this.

I was thinking about Laplace finding the populations of districts “distributed evenly throughout the Empire,” and Student inventing his t-test to help workers at the Guinness plants determine the statistical significance of their results. Laplace was not after accuracy, he was going for speed. Student was similarly looking for the minimum amount of effort required to produce an acceptable level of accuracy. The whole point was to free resources up for the next task.

I attended one paper at the conference that gave p-values for all its variables, and they were all 0.000. After that talk, I told the student who presented that those values indicated he had oversampled, and he should have stopped collecting data much sooner. “That’s what my advisor said too,” he said, “but this way we’re likely to get statistical significance for other variables we might want to study.”

The student had a point, but it doesn’t seem very – well, “agile” is a word I’ve been hearing a lot lately. In any case, as the conference was wrapping up, it occurred to me that I might have several hours free – on my flight home and before – to work on my research.

My initial impulse was to keep doing what I’ve been doing for the past couple of years: clean up OCRed text and tag it for negation. Then it occurred to me that I really ought to take my own advice. I had achieved statistical significance. That meant it was time to move on!

I have started working on the next chunk of the nineteenth century, from 1816 through 1830. I have also been looking into other variables to examine. I’ve got some ideas, but I’m open to suggestions. Send them if you have them!