The Big Idea: IBM Improves Speech Recognition Accuracy


Man using the voice recognition of the phoneWe’ve all been there…

“Hey, Siri, what is the driving distance between Bristow, Virginia and Chicago?”

“Here is the driving distance between Bristol, Tennessee and the Ohio border.”


Yes, sometimes we feel like technology is working against us when it comes to speech recognition. It could be an issue of accents. It could be the lack of use we give our virtual assistants. (They usually learn faster and recognize speech patterns with more use.) Whatever the case, it can be a touch frustrating when we try to connect with our voice activated companions and go misunderstood on what we are asking.

We’ve all been there, and IBM knows we’ve all been there, and that’s why the tech innovator is working hard to make our virtual assistants a bit more versed in enunciation and pronunciation.

Recently IBM addressed speech recognition accuracy. “We’re getting better and better with speech recognition,” George Saon, Principal Research Scientist for IBM, said on a recent blog post. “It’s unbelievably good these days. If you talk to Amazon Echo or Siri or any of these other devices they can really understand what you saying.”

IBM reported a 5.5% word rate accuracy, an astounding improvement and new bar set for other Virtual Assistant developers. But you’re probably wondering how accurate 5.5% is? Essentially, with that accuracy rate, one out of twenty words are understood in error.

To give you an idea of how dramatic an improvement of this new rate is? The previous record was at 6.9% accuracy.

business man with tablet and virtual iconMicrosoft made claims their own speech recognition was down to a 5.9% error rate in October 2016. They are using neural language models and have created associative word clouds with it for their Virtual Assistants to access. On achieving this low error rate, Microsoft believed 5.9% was equivalent to human quality of voice recognition. Make sense they would call it “human quality recognition” as this makes their own technology definitive. However, IBM believes human speech recognition is closer to 5.1% error rate. In other words, even a human listening to another human will  make a mistake within 5% of the conversation, or one out of every twenty words will go misheard or misunderstood.

IBM reached this 5.5 percent milestone by combining the artificial network Long Short-Term Memory, and WaveNets. By combining these different algorithms, the combined network creates an aggregate system far more accurate and “intelligent” than any of them individually. So think of it as a group of individuals brought together to solve the same problem as opposed to smart minds working alone. This new super-network dedicated to improving speech recognition brought down the error rate to 5.5%.

This new benchmark is quite a triumph for IBM, but they have made it clear they won’t be content with that number until they get the error rate down to the coveted 5.1% meaning their recognition software would be on par with human voice recognition. But as innovation goes, it wouldn’t surprise me if IBM were to reach even further and try to achieve better than that.

I just want my virtual assistant to know the difference between a Stratocaster, and music from the Stray Cats. That would be a step in the right direction.



shurtz.jpgA research physicist who has become an entrepreneur and educational leader, and an expert on competency-based education, critical thinking in the classroom, curriculum development, and education management, Dr. Richard Shurtz is the president and chief executive officer of Stratford University. He has published over 30 technical publications, holds 15 patents, and is host of the weekly radio show, Tech Talk. A noted expert on competency-based education, Dr. Shurtz has conducted numerous workshops and seminars for educators in Jamaica, Egypt, India, and China, and has established academic partnerships in China, India, Sri Lanka, Kurdistan, Malaysia, and Canada.