The story of SignSynth

Leave a comment December 24, 2025 Angus Andrea Grieve-Smith

The beginning

Every morning in the fall of 1997 I would wake up at 6AM and immediately jump out of bed. I went straight to my Fujitsu laptop in the home office, leaving my then-girlfriend sleeping in the bedroom. I didn’t have anywhere to be; nobody was paying me, and this wasn’t for a class. I was working on a new project: a sign-language synthesis prototype.

This project was the culmination of several threads in my life. I had been inspired to go back to grad school by spending time talking about language with my girlfriend, a linguist who had just earned her PhD. For the past two years I had been making my living in information technology, working my way up from Microsoft Office trainer to LAN operations tech. I had also been taking night classes at the American Sign Language Institute and participating in discussions on SLLING-L, a sign linguistics email list, and the SignWriting List.

I had chosen the linguistics doctoral program at the University of New Mexico because I knew it was strong in sign linguistics, and I had just arrived in August. I was taking a course in psycholinguistics with the sign linguist Jill Morford, and a course in ASL.

Several people had asked me, “So you’re interested in language, and you’re interested in computers. Is there a way you could combine these interests?” I replied, “Well, there’s speech synthesis,” but that technology was already pretty well established. I realized that there wasn’t much in the area of technology for sign languages. What would sign synthesis look like?

When I arrived in Albuquerque I mentioned this idea to Sean Burke, a linguist and programmer I had made friends with on an earlier visit. Sean suggested using Virtual Reality Modeling Language, a standard for describing three-dimensional objects and their movements. People could install a plugin in their web browsers, and point it to a VRML (since renamed Web3D) site, and the plugin would display the 3D animation.

I followed Sean’s tip and discovered that there was a standard for specifying and animating humanoids in VRML, and a way to control the plugin through Javascript. I created a basic animation of a sign using Javascript, but after numerous frustrations I concluded that it was best to assemble the VRML through a custom Perl templating system controlled by CGI forms.

Getting the word out

Once my system was working to the point that it could create intelligible signs, I showed it to the faculty and students who were studying sign languages. The strongest interest came from Jill Morford, my psycholinguistics professor, who recognized that I had created a sign-language analog of the early speech synthesizers that were used at Haskins Labs to demonstrate the categorical nature of speech perception and its acquisition in language learners.

Jill realized that my system, which I had dubbed SignSynth, could be used to test whether perception of sign languages is similarly categorical, and how it is acquired. The following year she obtained a grant from the National Institutes of Health to study the question, and included some money to support me part time and get an office that I shared with a couple other students.

Around that time I had a meeting with my Committee on Studies. They liked my work, but told me that simply creating a sign synthesis system wasn’t theoretical enough for a Linguistics dissertation. If I used it, I would have to use it to demonstrate an answer to some theoretical question.

Doubts

That year I also took a course in literacy education called Teaching Reading to the ESL Student. The professor, Leila Flores-Dueñas, was happy to support me applying these lessons to Deaf students, but throughout the course she stressed the importance of grounding our teaching in the priorities of our students, and orienting our research to the goals of the populations we were studying. She pointed out further that we should be working to help lift up people who were oppressed or disadvantaged, so when we are working with people in those situations we have a particular obligation to fit our work to their priorities.

I realized that the same principle applied to developing applications: an app that touches disadvantaged people should help them to fulfill their goals. This brought back to mind the fact that I had never seen a Deaf person ask for a sign synthesis or natural language processing application.

I actually hadn’t talked much with Deaf people about language technology. This was in part because there is a general suspicion of hearing people in Deaf communities, particularly of hearing people who come bearing language technology. The suspicion is well earned; check out what Deaf people have to say about Alexander Graham Bell.

I had difficulty overcoming this suspicion because I had only been studying American Sign Language for a few years. I could communicate with Deaf people, but only if they were patient, and many of them had no reason to be patient with some random hearing person. I realized that I didn’t necessarily have any technology that could help them accomplish their goals, and if i did, they weren’t necessarily going to try it.

Challenges

That year I ran into two major challenges in sign synthesis, both relating to the placement of hands in space. The higher-level challenge relates to signs that are called depicting or iconic verbs, or classifier predicates. In these signs, the relative location, orientation and movement of the signer’s hands are used to refer to a similar spatial relationship or movement. In a classic example, an ASL signer can depict the movement of a car by making the “3” vehicle classifier shape with one hand and moving that hand in a scaled-down version of the car’s movements.

Iconic verbs are very difficult to specify in a user interface, because the level of detail of the location and movement is only constrained by the signer’s control of their hands, and the audience’s visual perception abilities. Specifying, representing and transmitting such signs is, at a minimum, a vastly different task than that for lexical signs, where there is a relatively small set of locations and movements.

The lower-level challenge involved figuring out how to bend the shoulder, elbow and wrist on a humanoid figure so that the figure’s hand winds up in a particular place, facing in a particular direction. This is a well-known problem, called inverse kinematics, that we humans solve intuitively every time we make a gesture or pick up an object.

There are computer libraries for inverse kinematics, and I tried to connect my animation code to those libraries, but they were written in C++, not Perl, and after several months I still hadn’t figured it out.

I described these challenges of inverse kinematics and representing classifier predicates to my committee, but they did not consider them to be theoretical enough for a dissertation. I was able to create the videos for Jill Morford’s categorical perception experiments, and she suggested that I could study categorical perception for my dissertation. In retrospect I should maybe have taken her up on it, but instead I decided to study a cosmopolitan, largely privileged community where the speakers were all long dead: the history of Parisian French through theater.

Legacy

My papers on SignSynth are routinely cited as one of the pioneering works in text-to-sign, mostly by developers who didn’t have a Professor Flores-Dueñas to teach them the importance of serving the language community, and who don’t follow my lead in separating synthesis from translation.

The paper that Jill Morford wrote with me and my fellow students about categorical perception has also had lasting influence; we failed to find a strong effect of categorical perception to match those found in speech, and the implications of that are still being discussed.

I never wanted to commercialize SignSynth, knowing that average incomes for Deaf people tend to be significantly lower than the general population, so I made it publicly available as a web application. The original application relied on third-party browser plugins to display the VRML, but over time, web browser makers dropped support for those plugins.

In 2022 I rewrote SignSynth from scratch in Javascript, using a relatively new 3D graphics library, Three.js, and made the code available on GitHub. I updated the interface to follow “character pickers” like Richard Ishida’s IPA Character Picker, replacing the old interface that was heavy on drop-down menus. I also created a standalone character picker for the Stokoe notation that it used for the original text.

What SignSynth meant to me

Thinking back to 1997, what got me so excited, what inspired me to move across the country and start a doctoral program with no promise of funding, living off of part time jobs and student loans, and spend so many unpaid hours every day, was the idea that I was creating something new, something that could be useful to people.

When I was young, I admired inventors, both real-life inventors like Thomas Edison and fictional ones like Professor Bullfinch from the Danny Dunn books. My father, my stepfather and my mother’s boyfriends in between were all tinkerers, and they made useful things. I didn’t have money to buy a lot of tools and equipment, or space to keep it in, but I did have access to computers, and the skills to program them.

Looking back, there was some ego in this. Why was it important for me to make these things, and not someone else? Why not be satisfied fixing the Novell servers for Chase Manhattan, or even sending faxes for John Hancock?

To be fair to myself, I have never been a competitive inventor. While I was working on SignSynth I met several people who were working on similar systems, and I always tried to be positive, supportive and cooperative. I let the quality of my work speak for itself, and trusted people to judge its value for themselves.

I’ve never felt like I was the best inventor, programmer or researcher in the world, but even in 1997 I had a fair amount of experience and education in research and technology. I felt like those skills were being wasted when I was taking telephone messages at John Hancock, and even when I was reinstalling network drivers. Of course, there are a lot of highly skilled people out there, maybe more than there’s a need for.

After putting SignSynth on the shelf and focusing on French negation for my dissertation, I didn’t feel the same level of excitement; the theoretical advancements that my committee demanded felt small by comparison, although I’m still proud of them.

I have felt a bit more excitement for my recent work for the New School. It’s a more simple project, just retrieving class listings or final grades from the student information system and presenting it in a table, but it helps save time and effort for students and faculty. Less exciting, but still satisfying.

Photo of a modern woman dressed as a seventeenth-century English Puritan colonist, wearing a gray skirt and yellow sweater with white headscarf, collar and apron, in a reconstructed seventeenth-century kitchen with period cookware made of wood and ceramics.

I don’t want to chat with a machine

2 Comments October 15, 2025 Angus Andrea Grieve-Smith

I’ve always been uncomfortable around historical re-enactors at museums – people who are walking around in period costume while visitors in modern clothes are browsing the exhibits. I’m sure most of them are nice people, I get the idea that seeing people in period costume can help bring the past to life for visitors, and I’m happy if they talk to me as presenters (“I’m dressed as so-and-so, who was enslaved…”). But I can’t have a conversation with them in character.

I understand performance; I did my PhD on theater. I can watch actors on screen delivering lines and learn from them, or be moved and entertained by their performances. I’ve performed on stage a bit. I’m very familiar with the suspension of disbelief that comes with performance.

I even understand that there’s a bit of performance in many interactions, from museum guides to teaching to customer service, even with friends and family. And often there’s suspension of disbelief involved.

So why do I get so uncomfortable talking to historical re-enactors? Because it’s not clear who I’m performing for. I don’t have a role, unless it’s time traveler from the twenty-first century. If there are other visitors to the museum, and I think for some reason they’d benefit from seeing the historical character interact with a time traveler, sure I’d consider it. But if it’s just me and the presenter, and maybe a friend or family member of mine? The presenter has seen these performances hundreds of times. How does that fit in their job description?

But this post is not about historical re-enactors, it’s about “AI” – chatbots, large language models. And just as I don’t want to talk to a historical re-enactor in character, I don’t want to pretend I’m having a conversation with a chatbot.

I program computers for a living, and one thing I do is make chatbots. The most successful one doesn’t pretend to be a customer service agent, or a personal assistant, or an omniscient oracle. It just takes commands, like “summary for 123456” where 123456 is the number of a ticket in our customer service tracking system.

This is because the program has a function: it shows you summaries for tickets. It doesn’t think, it doesn’t have feelings, it doesn’t care if you’re nice or clever or concise. It just waits for input and provides an output.

And I know, because I used to work on language models, that all chatbots are programs like the ones I write every day. They wait for input and provide an output. The difference with large language models is that they’re “fancy autocomplete” – as Emily Bender calls it. LLMs like ChatGPT are programmed to examine a chat and provide the most likely response.

So why do chatbots often write things that look like they care? Why do they sometimes seem to be impressed or have other feelings about what people write to them? Because those are the most common responses they find in their training corpora. Just like the historical re-enactors are paid (or volunteer) to perform, the chatbots are programmed to perform.

And similar to the way historical re-enactors in museums aren’t paid to care about other people’s performances, chatbots aren’t programmed to care about the performance of the people who type things at them. Anything that looks like caring is itself a performance.

So what is the point of pretending that the chatbot is a customer service agent, or a personal assistant? If anyone ever reads my chat transcripts, they’re not going to do it to appreciate how well I play the role of a customer, or a VIP. If I want to perform for myself, I’ve got better ways of doing it.

I’m aware that there’s a whole industry of “prompt engineering” that’s developed in the past few years. It’s more like prompt guesswork: trial and error of various combinations of input text with the goal of getting the language model to produce a particular output, and no guarantee that the prompts will work on any other models, or even other versions of the same model. If I thought that there was some great value to be gotten from the model’s output I might try it, but I haven’t heard about anything worth performing this awkward pantomime for.

Cropped cover from the Who's album Who's Next

Everybody fronts

Leave a comment April 4, 2025 Angus Andrea Grieve-Smith

I could write – and have written – the past twenty years as a string of successes for my linguistics career: I have presented my work at many conferences on linguistics, digital humanities and literature, in several countries. My pioneering work in computational sign linguistics continues to be cited on a regular basis. In 2008 I was hired to teach linguistics at Saint John’s University, where I developed an introductory linguistics curriculum. In 2009 I received my doctorate in linguistics. In 2012 I was hired to work on natural language processing at New York University. In 2016 I released the first segment of the Digital Parisian Stage. In 2019 I published my first book, Building a Representative Theater Corpus: A Broader View of Nineteenth-Century French. In 2021 I released LanguageLab, an app for audio mimicry exercises.

But of course, on the internet people put up a front. I could just as easily write a history of failure; in fact I hear it in my head on a regular basis: In 2008 my dissertation advisor refused to write letters of recommendation for me; my entire committee has showed very little interest in my work and have barely cited me. In 2013 the grant I was hired to work on at NYU ended, and with a brief exception I haven’t been hired to do any further language modeling research. I’ve applied to dozens of full-time teaching jobs, but have not even been invited to an interview. I couldn’t make a living as an adjunct instructor in New York, so in 2015 I stopped teaching and in 2016 took a job as a computer programmer, with no linguistics focus. As far as I know, hardly anyone has used my sign language software, my French theater corpus, my book or LanguageLab.

The truth is somewhere in between glorious success and abject failure. A lot of my difficulty has been due to circumstances beyond my control. The professor who I thought would be my dissertation chair was not actively publishing by the time I started my studies, retired right before I proposed my thesis, and sustained severe brain damage in a bicycle crash shortly after I defended my dissertation. I graduated into an incredibly difficult job market, due to cuts to liberal arts education and general research funding and a glut of underemployed people with humanities PhDs.

I have also made principled choices that have affected my career. The professor who I cited extensively in my dissertation has never publicly acknowledged my work, I suspect because I declined to participate in her patronage system. I intentionally chose not to work on projects that I suspected were extractive, exploitative or overhyped. In order to take care of my young child and prioritize my wife’s career, I moved thousands of miles away from my university, suspended and delayed my doctorate, and limited my job search to opportunities near her job. In the past ten years I’ve similarly limited my job options in order to care for my mother, who is severely disabled with Parkinson’s disease.

For decades I’ve been a prolific poster on the social media of the times – email lists and USENET in the nineties, bulletin boards and blogs in the early 2000s, Twitter and Tumblr in the teens. I’ve mostly projected the first image of myself – the authoritative expert – while fearing that people would find out about my failures.

That confident persona often came off as patronizing or condescending, but I felt it was necessary to assert my authority and stake out my territory in order to be taken seriously, to advance my career. I tried to do it with as much compassion and generosity as I could.

I’m still proud of my knowledge and my accomplishments, and I want to be recognized for them. But now that my academic career appears to be stalled at best, there doesn’t seem to be much point in being brash the way I was.

The thing is, we’re all putting up a front. Academic competition seems to demand it. It’s very hard to do it without being cruel or indifferent to others.

Recognizing that everyone fronts means that it’s important to have compassion for the ways that other people front. Judging others for being cruel or indifferent is reasonable. Judging them for fronting at all is not.

Your athletic voice

Leave a comment September 24, 2023 Angus Andrea Grieve-Smith

This is the sixth post in a series inspired by Lake Bell’s audiobook chapter “Sexy Baby Voice.” In previous posts last year, I’ve covered the three key features she uses to define this vocal style – bright resonance (which Bell refers to as “high pitch”), creaky voice (“vocal fry”) and legato articulation (“slurring”), and discussed the various ways that we can manipulate our vocal tracts to create or amplify bright or dark resonances.

Elizabeth Holmes, wearing a black jacket over a black turtleneck with her hair in a bun, gestures in an explanatory way

Photo: Glen Davis

In my most recent post I talked about Bell’s disingenuous use of the phrase “here’s my voice” to suggest that bright resonance is fake, and mentioned a similar scene in the sitcom Loudermilk where the title character contradicts a young woman when she says, “this is my voice,” based on her use of creaky voice. Conversely, Elizabeth Holmes’s use of low pitch was seen as evidence that she wasn’t using her “real voice” – that her voice was as fraudulent as her business.

On an individual level, these accusations of fakery are simply false – these people are all using their own voices, and not pretending to be other people. The accusation is not that they’re trying to use someone else’s voice, but that they’re adopting vocal qualities that aren’t “really theirs.”

People like Bell, or Loudermilk, or the people who excitedly shared clips of Holmes speaking in a normal pitch range for American women, have strong feelings about this. Why do they care? I’ve heard three arguments: that producing speech with these qualities takes effort, and it covers up their “real voice.” And there’s an accusation of motive: their false voices are false pretenses to try and get something.

First, the effort: the biggest difference between “vocal fry” by itself and “sexy baby voice” is bright resonance. As I discussed in my post on youth, bright resonance is generally associated with youth and femininity, because it’s usually caused by small vocal folds in small vocal tracts, and women and children tend to be smaller and have thinner vocal folds. Even younger women tend to have brighter resonance than older women, primarily because of the effect of hormonal changes during childbirth and menopause.

It’s worth noting that Bell’s main target is older women who retain that combination of bright resonance and creak (or maybe even adopt it) in their middle age and beyond. Her claim is that these women’s “real voices” have darker resonances, so the bright resonance is fake, and it requires effort. So when she says that “sexy baby voice” is “athletic,” that only applies to older women.

I can confirm that point at least from some experience here, having testosterone-thickened vocal folds and a relatively large vocal tract: producing bright resonance takes a fair amount of work, especially if you try to adopt the habits all at once later in life. But if the women in question have adopted these habits gradually, as their vocal tracts change, then it may not take a lot of conscious effort.

Bell’s criticism of older women with “sexy baby voice” reminds me of similar criticism of older women for “dressing young,” in ways that may or may not be overtly sexualized, or taking on other features of the appearance of young women, like hairstyle, hair color and mannerisms, or even getting cosmetic surgery with the goal of looking younger. She is essentially accusing these older women of faking youth – or trying to – with their voices.

Bell’s message to these older women – you don’t have to try so hard! You can be accepted and valued for your maturity! Chasing youthful appearance is a trap! – seems benign on the surface. It’s not clear she’s getting through, just as it’s not clear others are getting through with similar messages about older women wearing short skirts or getting facial surgery. And if the effort is not that much once the habits are established, then there’s not much to the argument in the case of voice.

In a future post I’ll talk about the next argument, that older women who produce creaky voice with bright resonance and legato articulation are covering up their “real” voices, which of course use modal phonation, dark resonance and staccato articulation.

The supposed successes of AI

Leave a comment May 30, 2023 Angus Andrea Grieve-Smith

The author and cat, with pirate hats and eye patches superimposed by a Snapchat filter
I’m a regular watcher of Last Week Tonight with John Oliver, so in February I was looking forward to his take on “AI” and the large language models and image generators that many people have been getting excited about lately. I was not disappointed: Oliver heaped a lot of much-deserved criticism on these technologies, particularly for the ways they replicate prejudice and are overhyped by their developers.

What struck me was the way that Oliver contrasted large language models with more established applications of machine learning, portraying those as uncontroversial and even unproblematic. He’s not unusual in this: I know a lot of people who accept these technologies as a fact of life, and many who use them and like them.

But I was struck by how many of these technologies I myself find problematic and avoid, or even refuse to use. And I’m not some know-nothing: I’ve worked on projects in information retrieval and information extraction. I developed one of the first sign language synthesis systems, and one of the first prototype English-to-American Sign Language machine translation systems.

When I buy a new smartphone or desktop computer, one of the first things I do is to turn off all the spellcheck, autocorrect and autocomplete functions. I don’t enable the face or handprint locks. When I open an entertainment app like YouTube, Spotify or Netflix I immediately navigate away from the recommended content, going to my own playlist or the channels I follow. I do the same for shopping sites like Amazon or Zappos, and for social media like Twitter. I avoid sites like TikTok where the barrage of recommended content begins before you can stop it.

It’s not that I don’t appreciate automated pattern recognition. Quite the contrary. I’ve been using it for years – one of my first jobs in college was cleaning up a copy of the Massachusetts Criminal Code that had been scanned in and run through optical character recognition. For my dissertation I compiled a corpus from scanned documents, and over the past ten years I’ve developed another corpus using similar methods.

I feel similarly about synonym expansion – modifying a search engine to return results including “bicycle” when someone searches for “bike,” for example. I worked for a year for a company whose main product was synonym expansion, and I was really glad a few years later when Google rolled it out to the public.

There are a couple of other things that I find useful, like suggested search terms, image matching for attribution and Shazam for saving songs I hear in caf?s. Snapchat filters can be fun. Machine translation is often cheaper than a dictionary lookup.

Using these technologies as fun toys or creative inspiration is fine. Using them as unreliable tools that need to be thoroughly checked and corrected is perfectly appropriate. The problem begins when people don’t check the output of their tools, releasing them as completed work. This is where we get the problems documented by sites like Damn You Auto Correct: often humorous, but occasionally harmful.

My appreciation for automated pattern recognition is one of the reasons I’m so disturbed when I see people taking it for granted. I think it’s the years of immersion in all the things that automated recognizers got wrong, garbled or even left out completely that makes me concerned when people ignore the possibility of any such errors. I feel like an experienced carpenter watching someone nailing together a skyscraper out of random pieces of wood, with no building inspectors in sight.

When managers make the use of pattern recognition or generation tools mandatory, it goes from being potentially harmful to actively destructive. Search boxes that won’t let users turn off synonym expansion, returning wildly inaccurate results to avoid saying “nothing found,” make a mockery of the feature. I am writing this post on Google Docs, which is fine on a desktop computer, but the Android app does not let me turn off spell check. To correct a word without choosing one of the suggested corrections requires an extra tap every time.

Now let’s take the example of speech recognition. I have never found an application of speech recognition technology that personally satisfied me. I suppose if something happened to my hands that made it impossible for me to type I would appreciate it, but even then it would require constant attention to correct its output.

A few years ago I was trying to report a defective street condition to the New York City 311 hotline. The system would not let me talk to a live person until I’d exhausted its speech recognition system, but I was in a noisy subway car. Not only could the recognizer not understand anything I said, but the system was forcing me to disturb my fellow commuters by shouting selections into my phone.

I’ve attended conferences on Zoom with “live captioning” enabled, and at every talk someone commented on major inaccuracies in the captions. For people who can hear the speech it can be kind of amusing, but if I had to depend on those captions to understand the talks I’d be missing so much of the content.

I know some deaf people who regularly insist on automated captions as an equity issue. They are aware that the captions are inaccurate, and see them as better than nothing. I support that position, but in cases where the availability of accurate information is itself an equity issue, like political debates for example, I do not feel that fully automated captions are adequate. Human-written captions or human sign language interpreters are the only acceptable forms.

Humans are, of course, far from perfect, but for anything other than play, where accuracy is required, we cannot depend on fully automated pattern recognition. There should always be a human checking the final output, and there should always be the option to do without it. It should never be mandatory. The pattern recognition apps that are already all around us show us that clearly.

In a captioned scene from Loudermilk, a salesclerk says to Loudermilk, "I can't help it. This is my voice."

That is not your voice

Leave a comment December 25, 2022 Angus Andrea Grieve-Smith

This is the fifth post in a series inspired by Lake Bell’s audiobook chapter “Sexy Baby Voice.” In previous posts I’ve covered the three key features she uses to define this vocal style – bright resonance (which Bell refers to as “high pitch”), creaky voice (“vocal fry”) and legato articulation (“slurring”), and discussed the various ways that we can manipulate our vocal tracts to create or amplify bright or dark resonances. Now I want to talk about your voice.

Not your voice, but what people mean when they say “your voice.” A friend who’s a vocal coach and read my earlier posts sent me a not-very-funny opening scene from a sitcom called Loudermilk, where the title character (played by Ron Livingstone of Office Space) mocks and insults a young woman who takes his order at a coffee bar. The salesclerk is friendly, prompt and thorough; Loudermilk has no cause for complaint. His abuse is entirely based on his dislike for the sound of her voice.

Anyone who’s read this series or listened to the “Sexy Baby Voice” chapter will recognize three particular features of salesclerk’s voice: bright resonance, creaky voice, legato articulation. The Loudermilk scene could have been inspired by the scenes about “sexy baby voice” in Lake Bell’s 2013 film about the voice-over industry, In a World…

Loudermilk mocks the salesclerk’s creaky voice by using creaky voice in his own responses, and the salesclerk asks “why are you talking like that?” Loudermilk responds, “This is my voice,” and she says, “No, it’s not.” After mocking her voice more and ranting a bit, he says, “just stop doing that.” Her response mirrors the earlier exchange: “I can’t help it, this is my voice,” to which he responds, “No, it’s not.”

As Loudermilk receives his coffee and leaves, the salesclerk, infuriated by his abuse, shouts at his back, “You’re a total dick!” Surprise! She doesn’t use legato articulation or creaky voice – because it’s really fucking hard to shout with either of those features. He turns back and says, “There, there you go, you’re talking!” as though she’d proven his point.

Loudermilk’s insistence that the salesclerk’s use of creaky voice is not “your voice” echoes a deleted scene from In a World… that Lake Bell includes in the audiobook chapter. In the scene, Bell’s character conducts “a vocal experiment” on another character who habitually uses “sexy baby voice.” She asks the other character to count to ten, alternating “the lowest point in your register” (i.e. with dark resonance) on odd numbers with “the highest point in your register” (bright resonance) on even numbers, and then say “Here’s my voice.”

Of course, “Here’s my voice” is the eleventh utterance in the sequence. As an odd-numbered utterance, Bell’s character pronounces it with relatively dark resonance, and the other character follows suit. As with the Loudermilk scene, we’re meant to marvel at the transformation: this woman’s True Voice, stripped of all that sexy baby junk! The message of both scenes is the same: that “sexy baby voice” is fake and women only use it because they’re insecure, but maybe they can be tricked into experiencing the power of their True Voices.

I don’t know about you, but when I first heard the deleted scene with the “vocal experiment,” the first thing I thought of was Elizabeth Holmes, the business executive who is currently in prison for selling a fake technology to investors. In addition to amassing wealth and power through lies and hype, Holmes is famous for having an unusually low voice for a woman – not just dark resonances, but when she speaks publicly, her fundamental frequency is in the range more typically used by American men.

During the height of Holmes’s success, several people felt that her claims were too good to be true, and they suspected her voice of being fake too. When recordings surfaced of Holmes speaking in a more typical pitch range for an American woman, that was presented as casting doubt on her honesty in general. Is her voice as big a fraud as her company?

I’ll have more to say about the notion of “your voice” and what it means to accuse someone of habitually using a fake voice, but astute observers may note that this double bind – don’t talk too “high-pitched,” but don’t talk too low-pitched either! – is an echo to the double-binds put on women in all kinds of areas – be assertive but not bossy! be attractive but not slutty!

Slurring sexy babies

Leave a comment December 16, 2022 Angus Andrea Grieve-Smith

Recently I’ve written a few posts in response to the notion of “sexy baby voice” in Lake Bell’s latest audiobook. Bell identifies “sexy baby voice” with three characteristic features: “high pitch” (which I argue is actually bright resonance), “vocal fry” (what phoneticians call creaky voice) and “slurring.” I’ve argued that while bright resonance can be controlled to some degree, it is characteristic of youth and femininity, and that creaky voice is the only way that some young woman can add darker resonance (and hence a bit of gravitas) without sounding tomboyish or fussy.

I wanted to write a quick post about Bell’s third criterion, “slurring,” which Gladwell summaries as “running some words together,” and “sentences without spaces.” Bell’s caricature of slurring gets to the point where she sounds like she’s doing an impression of a drunk sorority girl, but in moderation this is a well-documented pattern of speech variation: some people are noted for short, quick transitions from one speech segment to the next and from one intonational pitch to the next, known as “staccato” articulation, while others take these transitions more gradually, designated by the Italian word “legato.”

Guess what the legato vs. staccato articulation patterns are associated with? Gender. I learned it from my voice teachers, Kristy Bissell and Erin Carney, as part of lessons on developing gender expression in the voice. I’m not familiar with research on this in phonetics, if any has been done.

Basically, staccato articulation is stereotypically associated with men barking orders, while legato articulation is associated with women discussing things in soft, flowing ways. Yes, these are stereotypes, and we can all think of women who bark orders and men with soft, legato articulation. But those women are perceived as acting masculine when they speak with staccato articulation, and men speaking legato are perceived as speaking in feminine ways.

It’s understandable why the use of legato articulation bothers Lake Bell so much: it’s the antithesis of a particular voice-over style that she admires. In her chapter she includes an audio clip of a film she made in 2013, In a World… Before listening to this chapter I had never heard of her or the film, but I discovered that it was seen by a fairly large number of people, and generally well appreciated. That film introduced the general public to her idea of “sexy baby voice,” and was discussed by Mark Liberman in a series of LanguageLog posts.

The name of the film references the famous phrase “In a world…” used in voice-over tracks to introduce trailers for science-fiction action films. In the film, Bell’s character is competing to be the first woman to voice these kinds of macho trailers. The thesis of the film is that women are just as capable as men of delivering this punchy, aggressive style of speech, and are being held back from that success by what else? “Sexy baby voice.”

Even without going to the hypermasculine extent of action film voice-overs, Bell is implicitly endorsing the management-consultant approach to voice and gender that treats any bias against women’s speech as evidence of a deficiency in the women’s speech itself, a deficiency that can be remedied with enough courses in proper speaking. This is extensively debunked by linguists like Deborah Cameron and Lisa Davidson in articles that I linked from previous posts.

So there we have the three features of “sexy baby voice”: bright resonance, which is an indicator of youth and femininity; creaky voice, which is one of a handful of strategies available to young women to darken their resonance, and legato articulation, which is also an indicator of femininity. If we find this in women who are actually young, it basically means that they want to get away from girlish voices without sounding like tomboys or fussy older women. Judging young women for this strikes me as unfair and mean-spirited.

I have to point out, however, that young women are not the main target of Bell’s “sexy baby voice” tirades. Her ire is directed at older women who, she argues, have other ways of accessing dark resonance but use bright resonance with creaky voice anyway. I’ll address that in another post!

Youth, authority, gender and creaky voice

Leave a comment December 9, 2022 Angus Andrea Grieve-Smith

Recently I’ve written two posts about bright resonance in response to Lake Bell’s audiobook chapter, “Sexy Baby Voice.” Bell describes “sexy baby voice” as having three characteristic features: “high pitch”, “vocal fry” and “slurring.” My first post supported Byron Ahn’s analysis that found that Bell’s “sexy baby voice” samples didn’t have reliably higher pitch than the non-“sexy baby voice” samples, and suggested that she’s probably talking about bright resonance. My second post drew on phonetic and pedagogical research to confirm Bell’s claim that while resonance is constrained by the size and shape of our vocal tracts, it can be consciously controlled to a certain degree.

In this post I want to connect bright resonance (what Bell calls “high pitch”) with creaky voice (“vocal fry”). The original reason they’re used together is youth.

Bell’s argument is that “sexy baby voice” keeps women from being taken seriously, so let’s imagine a young woman who wants to be taken seriously when she talks. Let’s say it’s 1990, and this woman is named Heather, and she has important things to say, whether it’s in a speech or in conversation. And importantly for our purposes, Heather is trendy and feminine.

On some level Heather is aware that dark resonance adds gravitas to speech. But she’s young, she’s petite, she hasn’t given birth and she doesn’t smoke, so she has a relatively short vocal tract and thin vocal folds. This means that without using any of the vocal habits I described in my last post, Heather’s voice will sound girlish, and will risk being prejudged as immature and unserious.

Heather may try some of those habits and find them wanting. She’s already avoiding twang and nasal resonance, which would make her voice sound even brighter. She could try rounding and protruding her lips and using the furthest-back tongue articulations, the time-honored strategy of boys and tomboys. But here’s the thing: she doesn’t want to sound too masculine. She wants to be feminine, but taken seriously. And maybe even sexy.

Another strategy, lowering the larynx, also clashes with the style she wants. It sounds too formal, too grand dame, too fussy. Not at all trendy or stylish.

Let’s imagine that after trying all these strategies, Heather’s a little tired and resigned. She relaxes her voice and it drops into creak. And it doesn’t sound fussy or tomboyish, but it has dark resonance. Maybe it even sounds a bit fashionably blas?!

And from a completely personal view, I just want to say that I do find creaky voice adds a bit of gravitas, and it can be very sexy. When I hear a woman with creaky voice combined with bright overtones, I get an impression of smallness in bigness. I think of creaky voice as the oversize sweater, boyfriend shirt or even mom jeans of the voice.

So Heather starts using creak whenever she wants to be taken seriously. And because she’s trendy, other young women imitate her. Heather is Creaker Zero of late twentieth century “vocal fry.”

Is that the way it actually happened? I have no idea. But it’s a possible scenario. And the scorn that’s been heaped on “vocal fry” over the past thirty plus years has been a potent example of the double bind that women are placed in so many times. Not enough dark resonance? Girlish. Rounded lips? Transgressing gender. Lowered larynx? Fussy. Creaky voice? You’re destroying your voice!

A lot of the politics of women’s voices has been covered by linguists I respect and admire, so for most of this I’ll just refer you to the responses of Deborah Cameron, Penny Eckert and Lisa Davidson to the 2015 “vocal fry” panic, and radio producer Katie Mingle’s all-purpose response to criticism of women’s voices.

This is one area where Malcolm Gladwell failed in this chapter. Gladwell is the producer of Bell’s audiobook and a friend of Bell, and in the chapter she turns to him for feedback. His biggest strength is the ability to find experts and present their ideas in ways that engage a broader audience, but in this chapter he doesn’t talk to Cameron, Eckert, Davidson or even Mingle. He just sits there and gives his own opinions, even conflating “high pitch” with “uptalk.” In his defense, it is possible that he tried to refer Bell to experts, but we don’t hear about it.

Controlling the brightness of the voice

2 Comments December 3, 2022 Angus Andrea Grieve-Smith

A few weeks ago I posted about “Sexy baby voice,” the topic of a chapter in Lake Bell’s audiobook about the culture and politics of voices. Bell identified three characteristics of “sexy baby voice” in women: high pitch, “vocal fry” (creaky voice) and “slurring.”

In phonetics, “pitch” is generally understood to refer to the fundamental frequency of the speech signal, but on Twitter the phonetician Byron Ahn posted the results of a computer analysis of some of the examples Bell gave for “high pitch” and pointed out that their fundamental frequencies weren’t much higher than the examples she gave for “normal” speech. In my post, I suggested that Bell is probably referring to the frequencies of harmonics in the speech, also called “resonance” or “formants.” It sounds like the most salient feature of “sexy baby voice” is bright resonance.

As I discussed in my post, bright resonance is generally associated with youth and femininity, because it’s usually caused by small vocal folds in small vocal tracts, and women and children tend to be smaller and have smaller vocal folds. Even younger women tend to have brighter resonance than older women, primarily because of the affect of hormonal changes during childbirth and menopause.

Of course, as Bell demonstrates repeatedly in her chapter, bright resonance can also be controlled, either consciously in the moment or subconsciously through habit and training. I’ve learned about these ways over the years, as a linguistics doctoral student, as a transgender woman and as an amateur singer. I’ll go through all the ways I know to do this.

Diagram of the vocal tract produced by the U.S. Centers for Disease Control and Prevention, and distributed by Wikimedia Commons.

As with the previous post in this series, my knowledge comes from training, not reading, so I don’t know who to credit for figuring all this out about the vocal tract. For now I will credit my primary teachers: the vocal coaches Kristy Bissell and Erin Carney, and the phoneticians Jacques Filliolet, Karen Landahl, Alex Francis and Doug Honorof.

For people who haven’t studied the anatomy of the vocal tract, this will get a little technical. In this blog post I’m going to use all the technical language, but if there’s a particular area that you feel could use more explanation for a general audience, please let me know.

Let’s start with the larynx and move through the vocal tract with the breath. The vocal folds generate sound through their vibration. When closing completely they generate a relatively coherent sound wave, but they can add dark resonance by maintaining gaps of particular sizes to allow low-frequency vibrations, what we call creaky voice or “vocal fry.” Similarly, they can add bright resonance by allowing turbulent air to flow through, causing breathy voice.

Just above the larynx is a tube called the pharynx. We can add bright resonance by constricting the pharynx, a practice that vocal coaches call “twang.” The name confused me for a while, because I associate the word “twang” with the Southern vowel shift, but in this case it refers to the narrowing of the pharynx.

The velum is a flap of muscle that we open to allow air to flow through the nose. When we allow air to flow through the nose and mouth at the same time it produces nasal resonance, which adds brighter resonance.

We use our tongues to produce consonants and vowels, raising a part of the tongue towards the roof of our mouths, so a /d/ sound is formed by touching the front of the mouth, and a /g/ sound by touching further back. For each of these sounds there is a range of positions along the roof of our mouth. When we raise our tongues further forward within the range for that sound, we generate brighter resonance. We can also generate bright resonance by flattening the tongue, allowing it to be raised higher. There is extensive research showing that women and gay men tend to have brighter resonance on their /s/ phonemes, and that people who make brighter /s/ sounds tend to be heard as women or gay men, even if they aren’t.

The lips are the gates that release our voices to the air outside. Rounded and protruded lips can produce darker resonance, and spread lips (in a smile or similar shape) can produce lighter resonance. I remember hearing about a study showing that even before puberty, boys tend to round their lips to sound more masculine.

One thing that makes this confusing is that all these vocal tract configurations have other functions. Creaky voice can be a sign of fatigue. Breathy voice can be a sign of relaxation. Pharyngeal constriction, nasal resonance, place of articulation and lip rounding can each change one word into another word with a completely different meaning, in Arabic, French, English and other languages.

These articulations can also interact with each other and with the fundamental frequency of the voice in different ways. At low frequencies, breathy voice can sound sympathetic or sexy, but at high frequencies it can sound weak and vulnerable. This may be what you want to project, or it may not. Nasal resonance and pharyngeal constriction can sound forced or strident, obnoxious or insensitive.

The bottom line is that these aspects of the voice are all under some degree of conscious control. How much control a speaker has, and how conscious they are, depends on a lot of factors, but the takeaway for Bell’s chapter is that people with smaller vocal tracts can use these techniques to speak with darker resonance than they would without them, and people with larger vocal tracts can use them to speak with brighter resonance than they otherwise would.

Note that I’m using the term “otherwise.” The terms I want to avoid, for this post at least, are “natural,” “authentic,” “real” and “your/my/their voice.” The tension between biological constraints, habit and conscious control is what makes resonance so fraught, politically, culturally and socially, which is why Bell and others have such intense feelings about it. That’s for another post.

Screenshot of the "Compose new Tweet" modal on Twitter, with the "+" button and a tooltip reading "Add another Tweet". The tweet texts reads "blah blah blah bl"

Dialogue and monologue in social media

Leave a comment November 27, 2022 Angus Andrea Grieve-Smith

I wrote most of this post in June 2022, before a lot of us decided to try out Mastodon. I didn’t publish it because I despaired of it making a difference. It felt like so many people were set in particular practices, including not reading blog posts! My experience on Mastodon has been so much better than the past several years on Twitter. I think this is connected with how Twitter and Mastodon handle threads.

A few years ago I wrote a critique of Twitter threads, tweetstorms, essays, and similar forms. I realize now that I didn’t actually talk much about what’s wrong with them. I focused on how difficult they are to read, but I didn’t realize how the native Twitter website and app actually makes them easier to read. So let me tell you some of the deeper problems with threads.

In 2001 I visited some of the computational linguistics labs at Carnegie Mellon University. Unfortunately I don’t remember the researchers’ names, but they described a set of experiments that has informed my thinking about language ever since. They were looking at the size of the input box in a communication app.

These researchers did experiments where they asked people to communicate with each other using a custom application. They presented different users with input boxes of different sizes: some got only a single line, others got three or four, and maybe some got six or eight lines.

What they found was that when someone was presented with a large blank space, as in an email application or the Google Docs application I’m writing this in, they tended to take their time and write long blocks of text, and edit them until they were satisfied. Only then did they hit send. Then the other user would do the same.

When the Carnegie Mellon researchers presented users with only one line, as in a text message app, their behavior was much different. They wrote short messages and sent them off with minimal editing. The short turnaround time resulted in a dialogue that was much closer to the rhythm of spoken conversation.

This echoed my own findings from a few years before. I was searching for features of French that I heard all over the streets of Paris, but had not been taught to me in school, in particular what linguists call right dislocation (“Ils sont fous, ces Romains”) and left dislocation (“L’?tat, c’est moi”).

In 1998 the easiest place to look was USENET newsgroups, and I found that even casual newsgroups like fr.rec.animaux were heavy on the formal, carefully crafted types of messages I remembered from high school French class. I had already read some prior research on this kind of language variation, so I decided to try something with faster dialogue.

In Internet Relay Chat (IRC) I hit the jackpot. On the #france IRC channel, left and right dislocations made up between 21% and 38% of all finite clauses. I noticed other features of conversational French like ne-dropping were common as well. I could even see IRC newbies adapting in real time: they would start off trying to write formal sentences the way they were taught in lyc?e, and soon give up and start writing the way they talked.

At this point I have to say: I love dialogue. Don’t get me wrong: I can get into a nice well-crafted monologue or monograph. And anyone who knows me knows I enjoy telling a good story or tearing off on a rant about something. But dialogue keeps me honest, and it keeps other people honest too.

Dialogue is not inherently or automatically good. On Twitter as in many other places, it is used to harass and intimidate. But when properly structured and regulated it can be a democratizing force. It’s important to remember how long our media has been dominated by monologues: newspapers, films, television. Even when these formats contain dialogues, they are often fictional dialogues written by a single author or team of authors to send a single message.

One of my favorite things about the internet is that it has always favored dialogue. Before large numbers of people were on the internet there was a large gap between privileged media sources and independent ones. Those of us who disagreed with the monologues being thrust upon us by television and newspapers were often reduced to impotently talking back at those powerful media sources, in an empty room.

USENET, email newsletters, personal websites and blogs were democratizing forces because they allowed anyone who could afford the hosting fees (sometimes with the help of advertisers) to command these monologic platforms. They were the equivalent of Speakers’ Corner in London. They were like pamphlets or letters to the editor or cable access television, but they eliminated most of the barriers to entry. But they were focused on monologues.

In the 1990s and early 2000s we had formats that encouraged dialogue, like mailing lists and bulletin boards, but they had large input boxes. As I saw on fr.rec.animaux in 1998, that encouraged long, edited messages.
We did have forums with smaller input boxes, like IRC or the group chats on AOL Instant Messenger. As I found, those encouraged people to write short messages in dialog with each other. When I first heard about Twitter with its 140-character limit I immediately recognized it as a dialogic forum.

But what sets Twitter apart from IRC or AOL Instant Messenger? Twitter is a broadcast platform. The fact that every tweet is public by default, searchable and assigned a unique URL, makes it a “microblog” site like some popular sites in China.

If someone said something on IRC or AIM in 1999 it was very hard to share it outside that channel. I was able to compile my corpus by creating a “bot” that logged on to the #france channel every night and logged a copy of all the messages. What Twitter and the sites it copied like Weibo brought was the combination of permanent broadcast, low barrier to entry, and dialogue.

This is why I’m bothered by Twitter threads, by screenshots of text, by the unending demands for an edit button. These are all attempts to overpower the dialogue on Twitter, to remove one of the key elements that make it special.

Without the character limits, Twitter is just a blogging platform. Of course, there’s nothing wrong with blogs! I’ve done a lot of blogging, I’ve done a lot of commenting on blogs and I’ve tweeted a lot of links to blogs. But I want to choose when to follow those links and go read those blog posts or news articles or press releases.

I want a feed full of dialogue or short statements. Threads and screenshots interrupt the dialogue. They aggressively claim the floor, crowding out other tweets. Screenshots interrupt the other tweets with large blocks of text, demanding to be read in their entirety. Threads take up even more of the timeline. The Twitter web app will show as many as three tweets of a thread, interrupting the flow of dialogue.

The experience of threads is much worse on Twitter clients that don’t manipulate the timeline, like TweetDeck (which was bought by Twitter in 2011) and HootSuite. If it’s a long thread, your timeline is screwed, and you have to scroll endlessly to get past it.

One of the things I love the most about Mastodon is the standard practice of making the first toot in a thread public, but publishing all the other toots as unlisted. That broadcasts the toot announcing the thread, and then gives readers the agency to decide whether they want to read the follow-up toots. It’s more or less the equivalent of including a link to a web page or blog post in a toot.

There’s a lot more to say about dialogue and social media, but for now I’m hugely encouraged by the feeling of being on Mastodon, and I’m hoping it leads us in a better direction for dialogue, away from threads and screenshots.