whisper-large-v3

Maintainer: openai - Last updated 5/28/2024

🌐

Model overview

The whisper-large-v3 model is a general-purpose speech recognition model developed by OpenAI. It is the latest version of the Whisper model, building on the previous Whisper large models. The whisper-large-v3 model has a few minor architectural differences from the previous large models, including using 128 Mel frequency bins instead of 80 and adding a new language token for Cantonese.

The Whisper model was trained on a massive 680,000 hours of audio data, with 65% English data, 18% non-English data with English transcripts, and 17% non-English data with non-English transcripts covering 98 languages. This allows the model to perform well on a diverse range of speech recognition and translation tasks, without needing to fine-tune on specific datasets.

Similar Whisper models include the Whisper medium, Whisper tiny, and the whisper-large-v3 model developed by Nate Raw. There is also an incredibly fast version of the Whisper large model by Vaibhav Srivastav.

Model inputs and outputs

The whisper-large-v3 model takes audio samples as input and generates text transcripts as output. The audio can be in any of the 98 languages covered by the training data. The model can also be used for speech translation, where it generates text in a different language than the audio.

Inputs

  • Audio samples in any of the 98 languages the model was trained on

Outputs

  • Text transcripts of the audio in the same language
  • Translated text transcripts in a different language

Capabilities

The whisper-large-v3 model demonstrates strong performance on a variety of speech recognition and translation tasks, with 10-20% lower error rates compared to the previous Whisper large model. It is robust to accents, background noise, and technical language, and can perform zero-shot translation from multiple languages into English.

However, the model's performance is uneven across languages, with lower accuracy on low-resource and low-discoverability languages where less training data was available. It also has a tendency to generate repetitive or hallucinated text that is not actually present in the audio input.

What can I use it for?

The primary intended use of the Whisper models is for AI researchers studying model capabilities, robustness, and limitations. However, the models can also be quite useful as a speech recognition solution for developers, especially for English transcription tasks.

The Whisper models could be used to build applications that improve accessibility, such as closed captioning or voice-to-text transcription. While the models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build near-real-time applications on top of them.

Things to try

One interesting aspect of the Whisper models is their ability to perform speech translation, generating text transcripts in a different language than the audio input. Developers could experiment with using the model for tasks like simultaneous interpretation or multilingual subtitling.

Another avenue to explore is fine-tuning the pre-trained Whisper model on specific datasets or domains. The blog post Fine-Tune Whisper with Transformers provides a guide on how to fine-tune the model with as little as 5 hours of labeled data, which can improve performance on particular languages or use cases.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Total Score

2.6K

Follow @aimodelsfyi on 𝕏 →

Related Models

🧠

Total Score

1.6K

whisper-large-v2

openai

The whisper-large-v2 model is a pre-trained Transformer-based encoder-decoder model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labeled data by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Compared to the original Whisper large model, the whisper-large-v2 model has been trained for 2.5x more epochs with added regularization for improved performance. Model inputs and outputs Inputs Audio samples**: The model takes audio samples as input and performs either speech recognition or speech translation. Outputs Text transcription**: The model outputs text transcriptions of the input audio. For speech recognition, the transcription is in the same language as the audio. For speech translation, the transcription is in a different language than the audio. Timestamps (optional)**: The model can optionally output timestamps for the transcribed text. Capabilities The whisper-large-v2 model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to translate speech from multiple languages into English with high accuracy. What can I use it for? The whisper-large-v2 model can be a useful tool for developers building speech recognition and translation applications. Its strong generalization capabilities suggest it may be particularly valuable for tasks like improving accessibility through real-time captioning, language translation, and other speech-to-text use cases. However, the model's performance can vary across languages, accents, and demographics, so users should carefully evaluate its performance in their specific domain before deployment. Things to try One interesting aspect of the whisper-large-v2 model is its ability to perform long-form transcription of audio samples longer than 30 seconds. By using a chunking algorithm, the model can transcribe audio of arbitrary length, making it a useful tool for transcribing podcasts, lectures, and other long-form audio content. Users can also experiment with fine-tuning the model on their own data to further improve its performance for specific use cases.

Read more

Updated 5/28/2024

Audio-to-Text

🤿

Total Score

438

whisper-large

openai

The whisper-large model is a pre-trained AI model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. Trained on 680k hours of labelled data, the Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The whisper-large-v2 model is a newer version that surpasses the performance of the original whisper-large model, with no architecture changes. The whisper-medium model is a slightly smaller version with 769M parameters, while the whisper-tiny model is the smallest at 39M parameters. All of these Whisper models are available on the Hugging Face Hub. Model inputs and outputs Inputs Audio samples, which the model converts to log-Mel spectrograms Outputs Textual transcriptions of the input audio, either in the same language as the audio (for speech recognition) or in a different language (for speech translation) The model can also output timestamps for the transcriptions Capabilities The Whisper models demonstrate strong performance on a variety of speech recognition and translation tasks, exhibiting improved robustness to accents, background noise, and technical language. They can also perform zero-shot translation from multiple languages into English. However, the models may occasionally produce text that is not actually spoken in the audio input, a phenomenon known as "hallucination". Their performance also varies across languages, with lower accuracy on low-resource and less common languages. What can I use it for? The Whisper models are primarily intended for use by AI researchers studying model robustness, generalization, capabilities, biases, and constraints. However, the models can also be useful for developers looking to build speech recognition or translation applications, especially for English speech. The models' speed and accuracy make them well-suited for applications that require transcription or translation of large volumes of audio data, such as accessibility tools, media production, and language learning. Developers can build applications on top of the models to enable near-real-time speech recognition and translation. Things to try One interesting aspect of the Whisper models is their ability to perform long-form transcription of audio samples longer than 30 seconds. This is achieved through a chunking algorithm that allows the model to process audio of arbitrary length. Another unique feature is the model's ability to automatically detect the language of the input audio and perform the appropriate speech recognition or translation task. Developers can leverage this by providing the model with "context tokens" that inform it of the desired task and language. Finally, the pre-trained Whisper models can be fine-tuned on smaller datasets to further improve their performance on specific languages or domains. The Fine-Tune Whisper with Transformers blog post provides a step-by-step guide on how to do this.

Read more

Updated 5/28/2024

Audio-to-Text

🧠

Total Score

1.2K

whisper-large-v3-turbo

openai

The whisper-large-v3-turbo model is a finetuned version of the pruned Whisper large-v3 model. It is the exact same model, except that the number of decoding layers have been reduced from 32 to 4, making the model significantly faster while only experiencing a minor quality degradation. The Whisper model was proposed by Alec Radford et al. from OpenAI and demonstrates strong generalization across many datasets and domains in a zero-shot setting. Model inputs and outputs The whisper-large-v3-turbo model is designed for automatic speech recognition (ASR) and speech translation. It takes audio samples as input and outputs text transcriptions. Inputs Audio samples**: The model accepts arbitrary length audio inputs, which it can process efficiently using a chunked inference algorithm. Outputs Text transcriptions**: The model outputs text transcriptions of the input audio, either in the same language as the audio (for ASR) or in a different language (for speech translation). Timestamps**: The model can optionally provide timestamps for each transcribed sentence or word. Capabilities The whisper-large-v3-turbo model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It also demonstrates strong zero-shot translation capabilities, allowing it to transcribe audio in one language and output the text in a different language. What can I use it for? The whisper-large-v3-turbo model is primarily intended for AI researchers studying the capabilities, biases, and limitations of large language models. However, it can also be a useful ASR solution for developers, especially for English speech recognition tasks. The speed and accuracy of the model suggest that others may be able to build applications on top of it that allow for near-real-time speech recognition and translation. Things to try One key capability to explore with the whisper-large-v3-turbo model is its ability to handle long-form audio. By using the chunked inference algorithm provided in the Transformers library, the model can efficiently transcribe audio files of arbitrary length. Developers could experiment with using this feature to build applications that provide accurate transcriptions of podcasts, interviews, or other long-form audio content. Another interesting aspect to investigate is the model's performance on non-English languages and its zero-shot translation capabilities. Users could try transcribing audio in different languages and evaluating the quality of the translations to English, as well as exploring ways to fine-tune the model for specific language pairs or domains.

Read more

Updated 11/2/2024

Text-to-Text

Total Score

41

whisper-medium.en

openai

The whisper-medium.en model is an English-only version of the Whisper pre-trained model for automatic speech recognition (ASR) and speech translation. Developed by OpenAI, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model was trained on 680k hours of labelled speech data using large-scale weak supervision. Similar models in the Whisper family include the whisper-tiny.en, whisper-small, and whisper-large checkpoints, which vary in size and performance. The whisper-medium.en model sits in the middle of this range, with 769 million parameters. Model inputs and outputs Inputs Audio waveform as a numpy array Sampling rate of the input audio Outputs Text transcription of the input audio, in the same language as the input Optionally, timestamps for the start and end of each transcribed text chunk Capabilities The whisper-medium.en model exhibits improved robustness to accents, background noise, and technical language compared to many existing ASR systems. It can also perform zero-shot translation from multiple languages into English. The model's accuracy on speech recognition and translation tasks is near state-of-the-art level. However, the model's weakly supervised training on large-scale noisy data means it may generate text that is not actually spoken in the audio input (hallucination). It also performs unevenly across languages, with lower accuracy on low-resource and low-discoverability languages. The model's sequence-to-sequence architecture makes it prone to generating repetitive text. What can I use it for? The whisper-medium.en model is primarily intended for use by AI researchers studying the robustness, generalization, capabilities, biases, and limitations of large language models. However, it may also be useful as an ASR solution for developers, especially for English speech recognition. The model's transcription capabilities could potentially be used to improve accessibility tools. While the model cannot be used for real-time transcription out of the box, its speed and size suggest that others may be able to build applications on top of it that enable near-real-time speech recognition and translation. There are also potential concerns around dual use, as the model's capabilities could enable more actors to build surveillance technologies or scale up existing efforts. The model may also have some ability to recognize specific individuals, which raises safety and privacy concerns. Things to try One interesting aspect of the whisper-medium.en model is its ability to perform speech translation in addition to transcription. You could experiment with using the model to translate audio from one language to another, or compare its performance on transcription versus translation tasks. Another area to explore is the model's robustness to different types of audio input, such as recordings with background noise, accents, or technical terminology. You could also investigate how the model's performance varies across different languages and demographics. Finally, you could look into fine-tuning the pre-trained whisper-medium.en model on a specific dataset or task, as described in the Fine-Tune Whisper with Transformers blog post. This could help improve the model's predictive capabilities for certain use cases.

Read more

Updated 9/6/2024

Audio-to-Text