AI for translation and transcription? Whisper is all you need.
December 15, 2023
5 min read

Whisper is all you need

Automatic Speech recognition (ASR) is an important component in accessibility tools like Siri, Alexa, and Google Assistant and is an important step in the direction of a hands-free experience of using computing devices. Whisper is an important advancement in the ASR Industry. It is the latest model developed by the non-profit Al research organization Open Al, the same company that built GPT-3. Whisper is a multilingual and multitask model that performs at the human level for ASR in the English Language. Thanks to the fact that Whisper is open-sourced, the future for translation and transcription along with tasks like voice activity recognition and speaker detection looks promising.

A Commercial Perspective

Whisper, as significant as an academic milestone that it is, the authors think that it can also be quite useful in a commercial setting as well. Whisper's performance on several benchmark datasets shows that it achieves human-level performance while working with audio in the English language. Whisper is not just limited to English. With proper fine-tuning, Whisper is expected to perform very strongly in approximately 10 other popular languages (Spanish, French, Italian, Russian, etc). While its strength as a language model is now apparent, it is not the only factor that makes Whisper a tempting alternative to your current/prospective ASR model. The speed and size of the Whisper family of models [variants in size] suggest that they can serve as a strong core for near-real-time systems to be built on top of them. The robustness and accuracy of Whisper, as a model, out of the box, is promising and can potentially be a plug-and-play option should the task-in-hand be trivial (transcription, translation) and in English, and if otherwise, a little fine-tuning should do the trick.

Technical Deep Dive

Introduction to the Problem Space.

For any AI to actually be ‘intelligent’ it requires training. The quality of data that is being used to train the model is an important factor that determines the quality of predictions/intelligence of the model. For tasks like Automatic Speech Recognition, supervised learning has proven to be ideal as the models trained in this method have exhibited better robustness and a better general understanding of language. But the poor relative size of such datasets (~1000 hours) has pushed academia towards unsupervised learning where researchers have managed to gather data (~1,000,000 hours) that is many folds greater than the available data for supervised learning. But the models that were trained on this type of data often were often found to be limited in terms of generalization. The fine-tuning of these models for dataset-specific applications came along with a recommended set of protocols.

Source: The Stanford AI Lab

Why Whisper?

Whisper addresses this issue. One important contribution of Whisper is the potential of using weakly supervised data for the speech recognition task. Weakly supervised data played a key role in the development of Whisper. Generally, the labels in such data are not necessarily GOLD standard, meaning, the labels for the data are generated either by people who’re not Subject Matter Experts (SMEs) or by pre-trained models. Whisper is trained on large (~680,000 hours) and diverse data with about a third of its entire dataset, not in English. The authors put together this dataset by collecting audio-transcript pairs from the internet and filtering out those pairs whose transcripts could’ve been produced by pre-trained speech recognition models (to eliminate any influence those models could possibly have on Whisper). The resulting data is further processed to make sure that the language used in the audio matches that used in the transcript. This tedious process explains why OpenAI has decided not to make the dataset public. By employing weakly supervised data for training, the authors have traded quality for quantity and the results speak for themselves.

The Zero shot performance of Whisper on diverse datasets is better than most SOTA [State of the Art] models, making less than 50% errors compared to the SOTA models. And, on top of this, Whisper is designed to handle a variety of speech-processing tasks out-of-the-box (more on this coming later).

Model Architecture

The architecture of Whisper is the same as the transformer proposed in the paper that introduces the transformer, ‘Attention is all you need. The input to the transformer here is an audio file. But before the audio file reaches the transformer, it is processed. The processing is two-fold. Firstly, the audio file is converted into a Log-Mel Spectrogram. The Mel scale is used to generate perceptually relevant frequency representations. [In the general Hertz scale, 2 sets of frequencies at the same distance, say 1000 Hz, 1200Hz, and 40Hz, 240Hz, do not sound the same to humans. This is where the Mel scale comes in, any 2 sets of frequencies at the same distance in Mels, sound the same]. Once we have the spectrogram ready, we want the dimensions of the spectrogram to match the transformer’s width. A stem of 2 convolutional layers takes care of this, and it also extracts information from the spectrogram. Secondly, the outputs of the stem of convolutional layers are positionally encoded as this allows the model to keep track of the position of sequential information obtained from the input audio file.

The processed input is now passed on to the stack of encoders in the transformer. Each block of encoder comprises self-attention and a Multi-Layer Perceptron [MLP]. The intuition behind a ‘stack’ of encoders is that we need multiple encoder blocks in order to capture the complexity of languages and accelerate understanding. The information that is learned by the encoder is passed on to the decoder by cross-attention. This information helps the decoder predict the next output sequence token. The input to the decoder at any given time is the cross-attention with the encoder, and the previous output of the decoder itself (if it’s the first word in a sentence, then it’s a start token) hence making it an auto-regressive model. 

Source: Robust Speech Recognition via Large-Scale Weak Supervision

What makes whisper important is its ability to multitask. Most ASR models require complex pipelines to incorporate different tasks alongside speech recognition. Whisper, on the other hand, can perform a variety of speech-processing tasks such as recognition, translation, and language identification without the need for explicit pipelines. All these tasks are represented together in a sequence of tokens that are to be predicted by the decoder, making whisper a potential go-to model for most speech-processing applications. The beginning of a sentence is indicated by the SOT [START OF TRANSCRIPT] token, then it predicts the language that is being used [LANGUAGE TAG], followed by the specified task token: Translation [TRANSLATE] or Transcription [TRANSCRIBE]. Then a timestamps token is generated depending on whether the user has requested time information or not. Finally, once the task is completed the EOT token is generated. If the model fails to recognize any literature in the provided audio file, it outputs a null [NO SPEECH] token.


Since the data that was used during the training phases was generated in a weakly supervised fashion, the models [variants in size] sometimes tend to output texts that are not actually in the audio file. The authors refer to this as hallucination. Also, when working with languages that have less training data, whisper, as one might assume, underperforms. Furthermore, authors have reported hiccups in performance in certain languages while handling dialects and accents.

Our Thoughts

During our time with Whisper, we found it very beginner friendly. The model is well-documented and very easy to work with just like our pre-assembled AI blueprints that could address your AI needs. We found Whisper to be pretty good at dealing with audio that was recorded in a noisy environment, audio that contained technical jargon, audio that was sped up, clips from rap songs, and different accents. On the flip side, there have been certain instances where certain variants of the models performed below par. We’ve also observed instances where certain multilingual variants of Whisper have translated the audio that was uttered in English into the language accent the sentence in the audio was spoken in when the specified task was ‘transcribe’. We believe that a little fine-tuning should fix this. Overall, we’re pretty satisfied with the performance of Whisper and we feel that with careful fine-tuning to the task in hand one can address possible shortcomings if any.