Speech recognition has gone from convenient to crucial over the last few years as smart speakers and driving assist modes have taken off — but not everyone’s voice is recognized equally well. Speechmatics claims to have the most inclusive and accurate model out there, beating Amazon, Google and others when it comes to speech outside of the most common American accents.
The company explained that it was guided towards the question of accuracy by a 2019 Stanford study entitled “Racial Disparities on Speech Recognition,” which found exactly that. Speech engines from Amazon, Apple, Google, IBM and Microsoft “exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.” Not great!
The source of this disparity may be partly attributed to a lack of diversity in the datasets used to train these systems. After all, if there are few black speakers in the data, the model will not learn those speech patterns as well. The same may be said for speakers with other accents, dialects, and so on — America (let alone the U.K.) is full of accents and any company claiming to make services for “everyone” should be aware of that.
At any rate, U.K.-based Speechmatics made accuracy in transcribing accented English a priority for its latest model, and it claims to have blown the others out of the water. Based on the same datasets used in the Stanford study (but using the latest versions of the speech software), “Speechmatics recorded an overall accuracy of 82.8% for African American voices compared to Google (68.7%) and Amazon (68.6%),” the company wrote in its press release.
The company credits this success to a relatively new approach to creating a speech recognition model. Traditionally, the machine learning system is provided with labeled data — think an audio file of speech with an accompanying metadata or text file that has what’s being said, usually transcribed and checked by humans. For a cat detection algorithm you’d have images and data saying which ones contain cats, where the cat is in each picture, and so on. This is supervised learning, where a model learns correlations between two forms of prepared data.
Speechmatics used self-supervised learning, a method that’s gained steam in recent years as datasets, learning efficiency, and computational power have grown. In addition to labeled data, it uses raw, unlabeled data and much more of it, building its own “understanding” of speech with far less guidance.
In this case the model was based on about 30,000 hours of labeled data to get a sort of base level of understanding, then was fed 1.1 million hours of publicly available audio sourced from YouTube, podcasts, and other content. This type of collection is a bit of a grey area, since no one explicitly consented to have their podcast used to train someone’s commercial speech recognition engine. But it’s being used that way by many, just as “the entire internet” was used to train OpenAI’s GPT-3, probably including thousands of my own articles. (Though it has yet to master my unique voice.)
In addition to improving accuracy for black American speakers, the Speechmatics model claims better transcription for children (about 92% accurate vs about 83% in Google and Deepgram) and small but significant improvements in English with accents from around the world: Indian, Filipino, Southern African and many others — even Scottish.
They support dozens of other languages and are competitive in many of them, as well; this isn’t just an English recognition model, but given the language’s use as a lingua franca (a hilariously inapt idiom nowadays), accents are especially important to it.
Speechmatics may be ahead in the metrics it cites, but the AI world is moves at an incredibly rapid clip and I would not be surprised to see further leapfrogging over the next year. Google, for instance, is hard at work on making sure its engines work for people with impaired speech. Inclusion is an important part of all AI work these days and it’s good to see companies trying to outdo each other in it.