Ten reasons why sign-to-speech is not going to be practical any time soon.

15 Comments April 12, 2016 Angus Andrea Grieve-Smith

It’s that time again! A bunch of really eager computer scientists have a prototype that will translate sign language to speech! They’ve got a really cool video that you just gotta see! They win an award! (from a panel that includes no signers or linguists). Technology news sites go wild! (without interviewing any linguists, and sometimes without even interviewing any deaf people).

…and we computational sign linguists, who have been through this over and over, every year or two, just *facepalm*.

The latest strain of viral computational sign linguistics hype comes from the University of Washington, where two hearing undergrads have put together a system that ? supposedly recognizes isolated hand gestures in citation form. But you can see the potential! *facepalm*.

Twelve years ago, after already having a few of these *facepalm* moments, I wrote up a summary of the challenges facing any computational sign linguistics project and published it as part of a paper on my sign language synthesis prototype. But since most people don’t have a subscription to the journal it appeared in, I’ve put together a quick summary of Ten Reasons why sign-to-speech is not going to be practical any time soon.

Sign languages are languages. They’re different from spoken languages. Yes, that means that if you think of a place where there’s a sign language and a spoken language, they’re going to be different. More different than English and Chinese.
We can’t do this for spoken languages. You know that app where you can speak English into it and out comes fluent Pashto? No? That’s because it doesn’t exist. The Army has wanted an app like that for decades, and they’ve been funding it up the wazoo, and it’s still not here. Sign languages are at least ten times harder.
It’s complicated. Computers aren’t great with natural language at all, but they’re better with written language than spoken language. For that reason, people have broken the speech-to-speech translation task down into three steps: speech-to-text, machine translation, and text-to-speech.
Speech to text is hard. When you call a company and get a message saying “press or say the number after the tone,” do you press or say? I bet you don’t even call if you can get to their website, because speech to text suuucks:

-Say “yes” or “no” after the tone.
-No.
-I think you said, “Go!” Is that correct?
-No.
-My mistake. Please try again.
-No.
-I think you said, “I love cheese.” Is that correct?
-Operator!
There is no text. A lot of people think that text for a sign language is the same as the spoken language, but if you think about point 1 you’ll realize that that can’t possibly be true. Well, why don’t people write sign languages? I believe it can be done, and lots of people have tried, but for some reason it never seems to catch on. It might just be the classifier predicates.
Sign recognition is hard. There’s a lot that linguists don’t know about sign languages already. Computers can’t even get reliable signs from people wearing gloves, never mind video feeds. This may be better than gloves, but it doesn’t do anything with facial or body gestures.
Machine translation is hard going from one written (i.e. written version of a spoken) language to another. Different words, different meanings, different word order. You can’t just look up words in a dictionary and string them together. Google Translate is only moderately decent because it’s throwing massive statistical computing power at the input – and that only works for languages with a huge corpus of text available.
Sign to spoken translation is really hard. Remember how in #5 I mentioned that there is no text for sign languages? No text, no huge corpus, no machine translation. I tried making a rule-based translation system, and as soon as I realized how humongous the task of translating classifier predicates was, I backed off. Matt Huenerfauth has been trying (PDF), but he knows how big a job it is.
Sign synthesis is hard. Okay, that’s probably the easiest problem of them all. I built a prototype sign synthesis system in 1997, I’ve improved it, and other people have built even better ones since.
What is this for, anyway? Oh yeah, why are we doing this? So that Deaf people can carry a device with a camera around, and every time they want to talk to a hearing person they have to mount it on something, stand in a well-lighted area and sign into it? Or maybe someday have special clothing that can recognize their hand gestures, but nothing for their facial gestures? I’m sure that’s so much better than decent funding for interpreters, or teaching more people to sign, or hiring more fluent signers in key positions where Deaf people need the best customer service.

So I’m asking all you computer scientists out there who don’t know anything about sign languages, especially anyone who might be in a position to fund something like this or give out one of these gee-whiz awards: Just stop. Take a minute. Step back from the tech-bling. Unplug your messiah complex. Realize that you might not be the best person to decide whether or not this is a good idea. Ask a linguist. And please, ask a Deaf person!

Note: I originally wrote this post in November 2013, in response to an article about a prototype using Microsoft Kinect. I never posted it. Now I’ve seen at least three more, and I feel like I have to post this. I didn’t have to change much.

15 thoughts on “Ten reasons why sign-to-speech is not going to be practical any time soon.”

Peter Bleackley

If you wanted to make a parallel corpus of BSL and English, your best starting point would be to contact the BBC. They have signed versions of many of their programmes, and the same programmes will also have subtitles that contain timing information for the spoken English. There is a research engineer at the BBC who has a good understanding of this problem space.

April 13, 2016 at 3:47 am Reply
Matt Brown

There are problems with that kind of approach. Rachel Sutton-Spence pioneered this kind of work by using a “corpus” of transcripts from “See Hear”, which included a lot of captions which were themselves translations of BSL. The approach to translation is going to have a major impact, however, and the decision process behind captioning works under different limitations to almost any other translation/interpreting domain.

April 13, 2016 at 9:42 am Reply
ryan hait-campbell

First off, completely agree with you regarding the UoW article. It is just outright silly as I’ve been telling people for a long time now. But I feel the need to correct you on one topic. You say “Just ask a deaf person” well guess what?

I am a deaf person and also started Motionsavvy.com with 3 other deaf individuals. We are tackling this problem and have been slowly developing our software. Our goal is to build software that can pair up with upcoming mobile devices that takes advantage of 3d camera technology.

We are funded by Rochester Institute of Technology and Wells Fargo and our engineering team is composed of some of the best engineers RIT has to offer and watch us make this a reality.

“I?m sure that?s so much better than decent funding for interpreters, or teaching more people to sign, or hiring more fluent signers in key positions where Deaf people need the best customer service.”

Who pays for all of this? The government or businesses of course. Interpreters are too expensive, it is only thanks to ada law that we even get this kind of access in America but what about internationally? This article is an insult to those who really need this kind of technology developed.

I understand you wrote this out of respect for warning the community about those kind of articles which I completely get, but I also think you should do a little research and see what other groups are working on this.

April 18, 2016 at 8:18 pm Reply
Alexandr Opalka

I love that you posted this because you seem to have a gross misconception of what is actually going on in the space both at the academic level and the commercial level. You should research into MotionSavvy which is disproving your statements. Additionally technologies such as the Leap Motion make high accuracy recognition a reality. It seems any research you did in this space is so outdated that any knowledge you have on the subject now may be irrelevent.

I will say this , Sign Language recognition is hard and humanity has accomplished tremendous things. If somebody wants to fund this then they should feel free with proper due diligence of course. Additionally I do agree the sensors of the type seen above don’t do the trick which is why MotionSavvy uses a Leap Motion sensor , and Intel Real Sense camera. Which handles the hand and arm tracking. Plus language processing to convert the ASL inputted signs into grammatically correct english. We were even ranked in Time Magazine as one of the top 25 inventions of 2014 , seems like we are onto something here. MotionSavvy is also founded and until recently ( 3 years of operating ) all deaf employees and co-founders.

April 18, 2016 at 8:35 pm Reply
grvsmth

Thank you for your comments, Ryan and Alexandra, and for sharing your work. I do appreciate your perspective as deaf people. Your Sign Builder and Crowd Sign tools sound very promising, and I wish you the best of luck.

That said, I remain skeptical that speech-to-sign will ever work reliably, much less by the summer of 2016. I used to “do a little research” and follow these stories, but I stopped because I found the constant hype to be depressing. Even if you’ve solved reason #6 (which I highly doubt), I don’t see any evidence that you’re making progress on the other obstacles. I look forward to the results when you put your talents to work on problems that can be solved in a reasonable amount of time.

April 18, 2016 at 9:14 pm Reply
Dan Parvaz

So, Motionsavvy.com — do you have any actual published results? I mean, detecting individuals manual signs is cool… HELLO MY NAME . But you have made it clear that you haven’t solved the major problems that make signed languages unique:

1. Non-manual markers.
2. The vast productive lexicon of signed languages — what are variously called “classifier predicates,” “depictive signs,” “embodied action,” etc.

How do you account for the way space gets used, even when you have a formal sign? a-GIVE-b. Who is giving to whom? Often those referents are established in space beforehand… sometimes, they’re in the room. Sometimes, it’s a mixture of the two. If you’ve solved the problems you say you’re solving, you’ve at least thought about this.

Also, who in the field of ASL linguistics or Deaf studies have you worked with? Have any sign language experts — not just the enthusiastic outsiders at TIME.com — really evaluated your work?

Less hype. More evidence. Please.

April 22, 2016 at 8:07 am Reply
Prof G. H. Turner

What Dan said. Let’s see a considered, substantial, even-tempered response please, Motionsavvy. We’ve all been waiting for this to be done well for DECADES, and will be generous and enthusiastic supporters of anyone who truly cracks these problems. Thanks for the original post, AG-S.

April 26, 2016 at 2:10 am Reply
Kevin Odom

My response is this: I am not impressed, because this will only be a one-way communication deal, for the benefit of the hearies (sign to speech). What about the reciprocity for the Deaf?! (Speech to sign)? Hearing people are just getting lazier and lazier in linguistics. If hearies would put just half of the effort that these scientists are doing with their experiments into actually learning sign language, we’d all be better off! Hearies, challenge yourselves? go learn the sign language of the country that you’re in!!! |m|/

April 27, 2016 at 7:42 pm Reply
Teresa Blankmeyer Burke

Hi Angus! Thanks so much for writing this excellent explanation. I’ve shared it on FB to counter all the emails and messages I’ve been getting about the signing gloves. Of course, the videos I’ve viewed about the gloves aren’t captioned except for a sentence or two of “ASL” (the irony), and the gestures (I cannot bring myself to call what the inventors are doing ASL) are indecipherable, so the videos are inaccessible to deaf people…

April 28, 2016 at 11:56 am Reply
David Swain

I read with great fascination the above discussion. Several years ago I had an idea whereby the deaf could communicate with the blind who could communicate with the profoundly disabled who could communicate with anyone. The key is to utilize Morse Code. Morse has been proven over centuries. It can be used with any low tech, inexpensive device that can emit sound or a light pulse or even down to a pencil and piece of paper. The alphabet is not difficult to learn and ham radio operators and the military have proven that it can be much more efficient than some of the high tech attempts at speech to text etc. Setting prejudices aside think of what Helen Keller could have accomplished with this approach. Christy Mathewson could write novels as well as paint with his left foot. Stephen Hawking , I can’t even imagine. The problem with sign language is there are far too few hearing individuals who are fluent in it and as pointed out it can be difficult to learn. Morse can be easily learned by almost anyone and would open up the opportunity for communication with hundreds of thousands who are currently cut off. Just somethign to think about.

April 29, 2016 at 9:39 am Reply
Jenni Robinson

Since there are no naturally speaking translation programs/devices/gadgets out there for spoken languages to date, I am extremely confident that this one-sided conversational wonder will not be accepted in the Deaf community even in the nearest few decades. Aside from the years required to learn the language, there has to be constant input from native signers – not just from some hearing inventors who think this is cute. It can’t be for the “ooohs” and “ahs” that this is endeavored. Even with the work Motionsavvy is doing, I still see decades of more work to produce a fully-functional and multi-lingual platform (i.e. with a two-sided conversation) . Kudos to the ingenuity, but nothing will replace actually learning the language.

April 30, 2016 at 3:26 am Reply
grvsmth

David, I’m sure a lot of people have thought about that. The biggest issue I can think of is bandwidth.

April 30, 2016 at 12:34 pm Reply
Emily M. Bender

The UW ASL program (and the Linguistics Department) have written an open letter responding to how the SignAloud project was promoted/hyped. (Note that the letter is endorsed by the SignAloud developers.) You can find it at this link:

https://catalyst.uw.edu/workspace/lforshay/10514/432760

Yes, ASL-to-English translation is hard, and yes the SignAloud folks missed much of that complexity, but the bigger issues here have to do with audism and cultural appropriation. I’m very glad to hear that there are Deaf-led efforts to work on MT for ASL!

May 26, 2016 at 8:02 pm Reply
grvsmth

Thanks for that link, Dr. Bender, and thanks for linking to this post on that page!

May 27, 2016 at 11:13 pm Reply
Mike Armstrong

https://www.theatlantic.com/technology/archive/2017/11/why-sign-language-gloves-dont-help-deaf-people/545441/

Nails this perfectly:-

“And the writers of the UW letter argued that the development of a technology based on a sign language constituted cultural appropriation. College students were gaining accolades and scholarships for technologies based on an element of Deaf culture, while Deaf people themselves are legally and medically underserved.”

and

?Still, as long as actual Deaf users aren?t included in these projects, inventors are likely to continue creating devices that offend the very group they say they want to help.?

November 10, 2017 at 8:48 am Reply

15 thoughts on “Ten reasons why sign-to-speech is not going to be practical any time soon.”

Leave a Reply Cancel reply