How Google’s Pixel Buds will change the world!

Scene: a quietly bustling bistro in Paris’s 14th Arrondissement.

SERVER: Oui, vous désirez?
PIXELBUDS: Yes, you desire?
TOURIST: Um, yeah, I’ll have the steak frites.
SERVER: Que les frites?
PIXELBUDS: Than fries?
TOURIST: No, at the same time.
SERVER: Alors, vous voulez le steak aussi?
PIXELBUDS: You want the steak too?
TOURIST: Yeah, I just ordered the steak.
SERVER: Okay, du steak, et des frites, en même temps.
PIXELBUDS: Okay, steak, and fries at the same time.
TOURIST: You got it.

(All translations by Google Translate. Photo: Alain Bachelier / Flickr.)

Ten reasons why sign-to-speech is not going to be practical any time soon.

It’s that time again! A bunch of really eager computer scientists have a prototype that will translate sign language to speech! They’ve got a really cool video that you just gotta see! They win an award! (from a panel that includes no signers or linguists). Technology news sites go wild! (without interviewing any linguists, and sometimes without even interviewing any deaf people).

…and we computational sign linguists, who have been through this over and over, every year or two, just *facepalm*.

The latest strain of viral computational sign linguistics hype comes from the University of Washington, where two hearing undergrads have put together a system that … supposedly recognizes isolated hand gestures in citation form. But you can see the potential! *facepalm*.

Twelve years ago, after already having a few of these *facepalm* moments, I wrote up a summary of the challenges facing any computational sign linguistics project and published it as part of a paper on my sign language synthesis prototype. But since most people don’t have a subscription to the journal it appeared in, I’ve put together a quick summary of Ten Reasons why sign-to-speech is not going to be practical any time soon.

  1. Sign languages are languages. They’re different from spoken languages. Yes, that means that if you think of a place where there’s a sign language and a spoken language, they’re going to be different. More different than English and Chinese.
  2. We can’t do this for spoken languages. You know that app where you can speak English into it and out comes fluent Pashto? No? That’s because it doesn’t exist. The Army has wanted an app like that for decades, and they’ve been funding it up the wazoo, and it’s still not here. Sign languages are at least ten times harder.
  3. It’s complicated. Computers aren’t great with natural language at all, but they’re better with written language than spoken language. For that reason, people have broken the speech-to-speech translation task down into three steps: speech-to-text, machine translation, and text-to-speech.
  4. Speech to text is hard. When you call a company and get a message saying “press or say the number after the tone,” do you press or say? I bet you don’t even call if you can get to their website, because speech to text suuucks:

    -Say “yes” or “no” after the tone.
    -I think you said, “Go!” Is that correct?
    -My mistake. Please try again.
    -I think you said, “I love cheese.” Is that correct?

  5. There is no text. A lot of people think that text for a sign language is the same as the spoken language, but if you think about point 1 you’ll realize that that can’t possibly be true. Well, why don’t people write sign languages? I believe it can be done, and lots of people have tried, but for some reason it never seems to catch on. It might just be the classifier predicates.
  6. Sign recognition is hard. There’s a lot that linguists don’t know about sign languages already. Computers can’t even get reliable signs from people wearing gloves, never mind video feeds. This may be better than gloves, but it doesn’t do anything with facial or body gestures.
  7. Machine translation is hard going from one written (i.e. written version of a spoken) language to another. Different words, different meanings, different word order. You can’t just look up words in a dictionary and string them together. Google Translate is only moderately decent because it’s throwing massive statistical computing power at the input – and that only works for languages with a huge corpus of text available.
  8. Sign to spoken translation is really hard. Remember how in #5 I mentioned that there is no text for sign languages? No text, no huge corpus, no machine translation. I tried making a rule-based translation system, and as soon as I realized how humongous the task of translating classifier predicates was, I backed off. Matt Huenerfauth has been trying (PDF), but he knows how big a job it is.
  9. Sign synthesis is hard. Okay, that’s probably the easiest problem of them all. I built a prototype sign synthesis system in 1997, I’ve improved it, and other people have built even better ones since.
  10. What is this for, anyway? Oh yeah, why are we doing this? So that Deaf people can carry a device with a camera around, and every time they want to talk to a hearing person they have to mount it on something, stand in a well-lighted area and sign into it? Or maybe someday have special clothing that can recognize their hand gestures, but nothing for their facial gestures? I’m sure that’s so much better than decent funding for interpreters, or teaching more people to sign, or hiring more fluent signers in key positions where Deaf people need the best customer service.

So I’m asking all you computer scientists out there who don’t know anything about sign languages, especially anyone who might be in a position to fund something like this or give out one of these gee-whiz awards: Just stop. Take a minute. Step back from the tech-bling. Unplug your messiah complex. Realize that you might not be the best person to decide whether or not this is a good idea. Ask a linguist. And please, ask a Deaf person!

Note: I originally wrote this post in November 2013, in response to an article about a prototype using Microsoft Kinect. I never posted it. Now I’ve seen at least three more, and I feel like I have to post this. I didn’t have to change much.

Making dreams and stuff

Yesterday Tyler Schnoebelen posted an important warning to anyone who thinks translation is simple, by going through various translations of the quote “the stuff that dreams are made of” from the movie the Maltese Falcon.  Unfortunately it’s even worse than he says, because the official French translation, at least, is not very good.

One spring evening when I was in college in Paris I went out to see Casablanca.  I had liked it in high school film class, and noticed all the references to it that pervade Anglo-American pop culture.  When I watched English-language movies in Paris I always checked the subtitles to pick up some good French expressions.  The lobby was full of French people excitedly chattering about how much they loved “Casa” too.

Well, if these French people loved “Casa,” it must’ve been for the visuals, or because they understood the English dialogue, because it sure wasn’t for the French subtitles.  I got to read one famous, poetic line after another rendered in dull clichés.  There were two that I remember most vividly.

There’s the line, “Here’s looking at you, kid,” that Rick says to Ilsa four times in the movie. I’ve never heard anyone say it when they weren’t referencing the movie.  It’s a toast, but it’s clearly a very affectionate toast, and Rick repeats it even when they’re not drinking, instead of something more direct, like “I love you,” as fits the conflicted nature of their relationship.  How did our subtitler render these complex nuances? With the most standard and formulaic French toast, “To your health!”

The other line that really struck me is the last one in the movie. [Um, SPOILER ALERT!] Police Captain Renault, who has demonstrated a lack of morals throughout the movie, has just allowed Ilsa to escape, shielded Rick from prosecution for killing a Nazi officer, and offered to flee to the Congo with him. As the two men walk off into the dark, Rick says, “Louis, I think this is the beginning of a beautiful friendship.”

By hedging with “I think” and explicitly referencing “the beginning,” Rick emphasizes that this is a turning point for them, and that he has been impressed with Renault’s actions. How did this come out in the subtitles?

Maintenant, nous sommes amis!

They didn’t even translate the whole thing! It was just “Now we’re friends!” with that silly little exclamation point on the end. No thinking, no beginning, not even a friendship, just “friends!”

I’m not necessarily faulting the subtitler. I don’t know when it was translated, and how famous the movie was at the time. The people who pay for subtitles are often in a hurry, but don’t want to pay for quality, so they get what they pay for.

Casablanca deserves better, of course, and for the seventieth anniversary DVD they got it. “Here’s looking at you, kid,” is translated as “À vous, mon petit !” and then finally, “Bonne chance, mon petit.” The last line is rendered as, “Louis, je crois que ceci est le début d’une merveilleuse amitié !” which is very literal, but a lot better than the first version.

I wouldn’t be at all surprised if the original subtitles for the Maltese Falcon were just as bad as those for Casablanca, and if they haven’t been updated for a seventieth anniversary release, they may still be bad. “L’étoffe dont sont faits les rêves,” isn’t horrible, but as Tyler points out, étoffe is a very concrete noun, usually indicating some kind of fabric or stuffing. The noun matière (also feminine) is much closer to the sense of “substance” that I think the Maltese Falcon screenwriters were aiming for, and it is in fact frequently used to translate the original Shakespeare quote, as in this 1882 translation: “Nous sommes de la matière dont on fait les rêves.”

My overall point is that translation is hard. Even the best translators get stumped regularly, and mediocre translators can put out some real howlers. More importantly, translation is not a repetitive and precise task like statistical analysis where computers can improve on the job that humans do. It requires knowledge, subtlety and art, and if this is the best that people can do, computers aren’t going to get anywhere close.