## Model overview

The `distil-large-v2` model is a distilled version of the Whisper `large-v2` model. It is **6 times faster**, 49% smaller, and performs **within 1% WER** on out-of-distribution evaluation sets compared to the larger Whisper model. This makes it a more efficient alternative for speech recognition tasks. The [Distil-Whisper repository](https://github.com/huggingface/distil-whisper/) provides the training code used to create this model.

## Model inputs and outputs

The `distil-large-v2` model is a speech recognition model that takes audio as input and outputs text transcriptions. It can handle audio of up to 30 seconds in length, and can be used for both short-form and long-form transcription.

### Inputs
- Audio data (e.g. wav, mp3, etc.)

### Outputs
- Text transcription of the input audio
- Optional: Timestamps for the transcribed text

## Capabilities

The `distil-large-v2` model demonstrates strong performance on speech recognition tasks, performing within 1% WER of the larger Whisper `large-v2` model. It is particularly adept at handling accents, background noise, and technical language. The model can also be used for zero-shot translation from multiple languages into English.

## What can I use it for?

The `distil-large-v2` model is well-suited for applications that require efficient and accurate speech recognition, such as automated transcription, accessibility tools, and language learning applications. Its speed and size also suggest that it could be used as a building block for more complex speech-to-text systems.

## Things to try

One interesting aspect of the `distil-large-v2` model is its ability to perform long-form transcription through the use of a chunking algorithm. This allows the model to transcribe audio samples of arbitrary length, which could be useful for transcribing podcasts, lectures, or other long-form audio content.

## Model overview

The `distil-medium.en` model is a distilled version of the Whisper medium.en model proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). It is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets compared to the original Whisper medium.en model. This makes it an efficient alternative for English speech recognition tasks.

The model is part of the [Distil-Whisper](https://aimodels.fyi/creators/huggingFace/distil-whisper) repository, which contains several distilled variants of the Whisper model. The [distil-large-v2](https://aimodels.fyi/models/huggingFace/distil-large-v2-distil-whisper) model is another example, which surpasses the performance of the original Whisper large-v2 model.

## Model inputs and outputs

### Inputs
- **Audio data**: The model takes audio data as input, in the form of log-Mel spectrograms.

### Outputs
- **Transcription text**: The model outputs transcribed text in the same language as the input audio.

## Capabilities

The `distil-medium.en` model demonstrates strong performance on English speech recognition tasks, achieving a short-form WER of 11.1% and a long-form WER of 12.4% on out-of-distribution evaluation sets. It is significantly more efficient than the original Whisper medium.en model, running 6.8 times faster with 49% fewer parameters.

## What can I use it for?

The `distil-medium.en` model is well-suited for a variety of English speech recognition applications, such as transcribing audio recordings, live captioning, and voice-to-text conversion. Its efficiency makes it a practical choice for real-world deployment, particularly in scenarios where latency and model size are important considerations.

## Things to try

You can use the `distil-medium.en` model with the Hugging Face Transformers library to perform short-form transcription of audio samples. The model can also be used for long-form transcription by leveraging the chunking capabilities of the `pipeline` class, allowing it to handle audio files of arbitrary length.

Additionally, the Distil-Whisper repository provides training code that you can use to distill the Whisper model on other languages, expanding the model's capabilities beyond English. If you're interested in distilling Whisper for your language, be sure to check out the [training code](https://github.com/huggingface/distil-whisper/tree/main/training).

## Model overview

The `distil-small.en` model is a distilled version of the Whisper model proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430). It is the smallest Distil-Whisper checkpoint, with just 166M parameters, making it the ideal choice for memory constrained applications. Compared to the Whisper [small.en](https://huggingface.co/openai/whisper-small.en) model, `distil-small.en` is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets. For most other applications, the [distil-medium.en](https://aimodels.fyi/models/huggingFace/distil-mediumen-distil-whisper) or [distil-large-v2](https://aimodels.fyi/models/huggingFace/distil-large-v2-distil-whisper) checkpoints are recommended, since they are both faster and achieve better WER results.

## Model inputs and outputs

The `distil-small.en` model is an automatic speech recognition (ASR) model that takes audio as input and generates a text transcript as output. It uses an encoder-decoder architecture, where the encoder maps the audio input to a sequence of hidden representations, and the decoder auto-regressively generates the output text.

### Inputs
- Audio data in the form of a raw waveform or log-mel spectrogram

### Outputs
- A text transcript of the input audio

## Capabilities

The `distil-small.en` model is capable of transcribing English speech with high accuracy, even on out-of-distribution datasets. It demonstrates robust performance in the presence of accents, background noise, and technical language. The distilled model maintains performance close to the larger Whisper [small.en](https://aimodels.fyi/models/huggingFace/whisper-smallen-openai) model, while being significantly faster and smaller.

## What can I use it for?

The `distil-small.en` model is well-suited for deployment in memory-constrained environments, such as on-device applications, where the small model size is a key requirement. It can be used to add high-quality speech transcription capabilities to a wide range of applications, from accessibility tools to voice interfaces.

## Things to try

One interesting thing to try with the `distil-small.en` model is to use it as an assistant model for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding) with the larger Whisper models. By combining `distil-small.en` with Whisper, you can obtain the exact same outputs as Whisper while being 2 times faster, making it a drop-in replacement for existing Whisper pipelines.