Baidu Announces Major Advances in Speech Recognition Using Neural Networks

Baidu has unveiled new work done at their Silicon Valley AI Lab (SVAIL). This include the ability to accurately recognize both English and Mandarin with a single learning algorithm. The algorithm replaces entire pipelines of hand-engineered components with neural networks.

SVAIL’s Deep Speech system, announced last year, initially focused on improving English speech recognition accuracy in noisy environments like restaurants, cars and public transportation.

Over the past year, the researchers improved Deep Speech’s performance in English and also trained it to transcribe Mandarin.

The Mandarin version achieves high accuracy in many scenarios and is ready to be deployed on a large scale in real-world applications, such as web searches on mobile devices.

Andrew Ng, Chief Scientist at Baidu, commented: "SVAIL has demonstrated that our end-to-end deep learning approach can be used to recognize very different languages.

Key to our approach is our use of high-performance computing techniques, which resulted in a 7x speedup compared to last year at this time. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly."

Commenting on Deep Speech’s high-performance computing architecture, Dr. Bill Dally, Chief Scientist, NVIDIA, said: "I am very impressed by the efficiency Deep Speech achieves by using batching to deploy DNNs for speech recognition on GPUs. Deep Speech also achieves remarkable throughput while training RNNs on clusters of 16 GPUs. "

"We believe these techniques will continue to scale, and thus conclude that the vision of a single speech system that outperforms humans in most scenarios is imminently achievable," conclude the researchers.

In the paper, SVAIL also reported that Deep Speech is learning to process English spoken in various accents from around the world. Currently, such processing is challenging for popular speech systems used by mobile devices.

Deep Speech has made rapid improvement on a range of English accents, including Indian-accented English as well as accents from countries in Europe where English is not the first language.

"I had a glimpse of Deep Speech’s potential when I previewed it in its infancy last year," said Dr. Ian Lane, Assistant Research Professor of Engineering, Carnegie Mellon University. "Today, after a relatively short time, Deep Speech has made significant progress.

Using a single end-to-end system, it handles not only English but Mandarin, and is on its way to being released into production. I’m intrigued by Baidu’s Batch Dispatch process and its capacity to shape the way large deep neural networks are deployed on GPUs in the cloud."