Accelerated transcription, word-level timestamps and diarization with whisperX large-v3

## Model overview

`whisperx` is a speech transcription model developed by researchers at Upmeet. It builds upon OpenAI's Whisper model, adding features like accelerated transcription, word-level timestamps, and speaker diarization. Unlike the original Whisper, `whisperx` supports batching for faster processing of long-form audio. It also offers several model variants optimized for different hardware setups, including the [victor-upmeet/whisperx-a40-large](https://replicate.com/victor-upmeet/whisperx-a40-large) and [victor-upmeet/whisperx-a100-80gb](https://replicate.com/victor-upmeet/whisperx-a100-80gb) models.

## Model inputs and outputs

`whisperx` takes an audio file as input and generates a transcript with word-level timestamps and optional speaker diarization. It can handle a variety of audio formats and supports language detection and automatic transcription of multiple languages.

### Inputs
- **Audio File**: The audio file to be transcribed
- **Language**: The ISO code of the language spoken in the audio (optional, can be automatically detected)
- **VAD Onset/Offset**: Parameters for voice activity detection
- **Diarization**: Whether to assign speaker ID labels
- **Alignment**: Whether to align the transcript to get accurate word-level timestamps
- **Speaker Limits**: Minimum and maximum number of speakers for diarization

### Outputs
- **Detected Language**: The ISO code of the detected language
- **Segments**: The transcribed text, with word-level timestamps and optional speaker IDs

## Capabilities

`whisperx` provides fast and accurate speech transcription, with the ability to generate word-level timestamps and identify multiple speakers. It outperforms the original Whisper model in terms of transcription speed and timestamp accuracy, making it well-suited for use cases such as video captioning, podcast transcription, and meeting notes generation.

## What can I use it for?

`whisperx` can be used in a variety of applications that require accurate speech-to-text conversion, such as:

- **Video Captioning**: Generate captions for videos with precise timing and speaker identification.
- **Podcast Transcription**: Automatically transcribe podcasts and audio recordings with timestamps and diarization.
- **Meeting Notes**: Transcribe meetings and discussions, with the ability to attribute statements to individual speakers.
- **Voice Interfaces**: Integrate `whisperx` into voice-based applications and services for improved accuracy and responsiveness.

## Things to try

Consider experimenting with different model variants of `whisperx` to find the best fit for your hardware and use case. The [victor-upmeet/whisperx](https://replicate.com/victor-upmeet/whisperx) model is a good starting point, but the [victor-upmeet/whisperx-a40-large](https://replicate.com/victor-upmeet/whisperx-a40-large) and [victor-upmeet/whisperx-a100-80gb](https://replicate.com/victor-upmeet/whisperx-a100-80gb) models may be more suitable if you encounter memory issues when dealing with long audio files or when performing alignment and diarization.