The world speaks roughly around 6,500 languages compared to technology apps that speak only a few, causing great inconvenience in communication between user and product. The Big Tech recognizes ‘speech recognition’ as an essential function. They are trying their best to tutor their machines to speak and recognize languages to better their business impact. Alexa, Siri, Google Assistant are some of the guys in the field of innovation.
Expansion of the Machine Learning Language
Now, since technology is available only to a small fraction of the thousands of languages spoken globally, it is an untapped potential expansion for Facebook and other social media giants. To solve this pressing problem, Facebook got its AI team to develop wav2vec Unsupervised (wav2vec-U). Through this technology, Facebook engineers can build speech recognition systems that can interact with users of any language.
Unsupervised ASR Market Size & Players
The speech and voice recognition market is expected to reach USD 21.5 billion by 2024 from USD 7.5 billion in 2018, at a CAGR of 19.18%. There are already plenty of AI speech transcription platforms on the market. There’s Braina, Windows Speech Recognition, CMU Sphinx, and the Dragon Speech Recognition, to name a few. Contrary to the general process that involves feeding high-quality systems with large amounts of transcribed speech audio, Facebook’s unsupervised AI model is a charm. It helps users of the lesser-known ethnolinguistic groups from enjoying social technologies in their businesses.
Facebook’s Unsupervised Wav2vec-U on Language Maturity
Facebook’s unsupervised ML model, wav2vec-U, is an innovation of high merit. It can competently build speech recognition systems without any transcribed speech audio. Facebook considers this a higher version of the best-supervised models trained on nearly 1,000 hours of transcribed speech. The engineers at Facebook tested wav2vec-U with lesser languages such as Swahili, Tatar, Kyrgyz, and other languages that didn’t have high-quality speech recognition models for lack of adequate feed. Wav2vec-U is a case of self-supervised learning, of unsupervised machine translation for the state-of-the-art speech recognition binding communities through improved communication.
Today, Wav2vec-U self learns from recorded speech audio and unpaired text without any transcriptions. The infrastructure of Facebook’s Wav2vec-U is a novel one compared to the previous Automatic Speech Recording (ASR) systems. It starts with learning the structure of speech from unlabeled audio. Later, using wav2vec 2.0 and a simple k-means clustering method, the voice recording is segmented into speech units that loosely correspond to individual sounds.
The Skeleton of the Unsupervised Wav2vec-U
Scientists at Facebook developed the language model to recognize words in a recording from a generative adversarial network (GAN) comprising a generator and a discriminator. Here, the generator takes audio segments to predict a phoneme (i.e., unit of sound) corresponding to a sound in language. This is achieved by fooling the discriminator that evaluates if the predicted sequences sound realistic. In most cases, the generator and the discriminator have poor output but substantially improves to the desired level with time.
Unsupervised Wav2vec-U in Public Domain
Facebook says that it has tested the Wav2vec-U works in practice. They have evaluated it first on a benchmark called TIMIT acoustic-phonetic Continuous Speech Corpus. They “trained the model with 9.6 hours of speech and 3,000 sentences of text data,” to reduce the error rate by 63% when compared to other best-in-class unsupervised methods. According to Facebook AI research scientist manager, Michael Auli, training a Wav2vec-U model is half a day on a single GPU to make technology accessible to a wider audience to build speech technology for many more languages of the world. What more, Facebook has made this self-supervised pre-training of the model available on GitHub for public use.
Prior Work: Hidden Markov Model
In Springer, researchers discuss the ASR-based unsupervised adaptation method for Hidden Markov Model (HMM) speech synthesis and quality evaluation. “The adaptation technique automatically controls the number of phone mismatches. The evaluation involves eight different HMM voices, including supervised and unsupervised speaker adaptation. The effects of segmentation and linguistic labeling errors in adaptation data are also investigated. The results show that unsupervised adaptation can contribute to speeding up the creation of new HMM voices with comparable quality to supervised adaptation.”
Unsupervised wav2vec-U: The Future
Facebook AI has made huge progress with the first introduction of wav2vec, followed by wav2vec 2.0, and now with wav2vec-U. The belief is that the progress will lead to highly effective speech recognition technology for global languages and dialects. The purpose behind releasing the code for Facebook is to build speech recognition systems from unlabeled text and speech audio recordings.
As the ASR technology evolves, it forges new paths for businesses and communities to flourish. An embedded machine learning program that learns through observing the user eliminates the need for large amounts of labeled data. “Developing these sorts of more intelligent systems is an ambitious, long-term scientific vision, and we believe wav2vec-U will help us advance toward that important and exciting goal.” says Facebook.