whisper-large-v3, incredibly fast, powered by Hugging Face Transformers! 🤗

## Model overview

The `incredibly-fast-whisper` model is an opinionated CLI tool built on top of the [OpenAI Whisper](https://huggingface.co/openai/whisper-large-v3) large-v3 model, which is designed to enable blazingly fast audio transcription. Powered by Hugging Face Transformers, Optimum, and Flash Attention 2, the model can transcribe 150 minutes of audio in less than 98 seconds, a significant performance improvement over the standard Whisper model. This tool is part of a community-driven project started by [vaibhavs10](https://aimodels.fyi/creators/replicate/vaibhavs10) to showcase advanced Transformers optimizations.

The `incredibly-fast-whisper` model is comparable to other Whisper-based models like [whisperx](https://aimodels.fyi/models/replicate/whisperx-victor-upmeet), [whisper-diarization](https://aimodels.fyi/models/replicate/whisper-diarization-thomasmol), and [metavoice](https://aimodels.fyi/models/replicate/metavoice-camenduru), each of which offers its own unique set of features and optimizations for speech-to-text transcription.

## Model inputs and outputs

### Inputs
- **Audio file**: The primary input for the `incredibly-fast-whisper` model is an audio file, which can be provided as a local file path or a URL.
- **Task**: The model supports two main tasks: transcription (the default) and translation to another language.
- **Language**: The language of the input audio, which can be specified or left as "None" to allow the model to auto-detect the language.
- **Batch size**: The number of parallel batches to compute, which can be adjusted to avoid out-of-memory (OOM) errors.
- **Timestamp format**: The model can output timestamps at either the chunk or word level.
- **Diarization**: The model can use Pyannote.audio to perform speaker diarization, but this requires providing a Hugging Face API token.

### Outputs
The primary output of the `incredibly-fast-whisper` model is a transcription of the input audio, which can be saved to a JSON file.

## Capabilities

The `incredibly-fast-whisper` model leverages several advanced optimizations to achieve its impressive transcription speed, including the use of [Flash Attention 2](https://github.com/Dao-AILab/flash-attention) and [BetterTransformer](https://github.com/huggingface/optimum/tree/main/optimum/bettertransformer). These optimizations allow the model to significantly outperform the standard Whisper large-v3 model in terms of transcription speed, while maintaining high accuracy.

## What can I use it for?

The `incredibly-fast-whisper` model is well-suited for applications that require real-time or near-real-time audio transcription, such as live captioning, podcast production, or meeting transcription. The model's speed and efficiency make it a compelling choice for these types of use cases, especially when dealing with large amounts of audio data.

## Things to try

One interesting feature of the `incredibly-fast-whisper` model is its support for the distil-whisper/large-v2 checkpoint, which is a smaller and more efficient version of the Whisper model. Users can experiment with this checkpoint to find the right balance between speed and accuracy for their specific use case.

Additionally, the model's ability to leverage Flash Attention 2 and BetterTransformer optimizations opens up opportunities for further experimentation and customization. Users can explore different configurations of these optimizations to see how they impact transcription speed and quality.