⚡️ Fast audio transcription | whisper large-v3 | speaker diarization | word & sentence level timestamps | prompt | hotwords

## Model overview

`whisper-diarization` is a fast audio transcription model that combines the powerful Whisper Large v3 model with speaker diarization from the Pyannote audio library. This model provides accurate transcription with word-level timestamps and the ability to identify different speakers in the audio. Similar models like [whisperx](https://aimodels.fyi/models/replicate/whisperx-victor-upmeet) and [voicecraft](https://aimodels.fyi/models/replicate/voicecraft-cjwbw) also offer advanced speech-to-text capabilities, but `whisper-diarization` stands out with its speed and ease of use.

## Model inputs and outputs

`whisper-diarization` takes in audio data in various formats, including a direct file URL, a Base64 encoded audio file, or a local audio file path. Users can also provide a prompt containing relevant vocabulary to improve transcription accuracy. The model outputs a list of speaker segments with start and end times, the detected number of speakers, and the language of the spoken words.

### Inputs
- `file_string`: Base64 encoded audio file
- `file_url`: Direct URL to an audio file
- `file`: Local audio file path
- `prompt`: Vocabulary to improve transcription accuracy
- `group_segments`: Option to group short segments from the same speaker
- `num_speakers`: Specify the number of speakers (leave empty to autodetect)
- `language`: Language of the spoken words (leave empty to autodetect)
- `offset_seconds`: Offset in seconds for chunked inputs

### Outputs
- `segments`: List of speaker segments with start/end times, average log probability, and word-level probabilities
- `num_speakers`: Number of detected speakers
- `language`: Detected language of the spoken words

## Capabilities

`whisper-diarization` excels at fast and accurate audio transcription, even in noisy or multilingual environments. The model's ability to identify different speakers and provide word-level timestamps makes it a powerful tool for a variety of applications, from meeting recordings to podcast production.

## What can I use it for?

`whisper-diarization` can be used in many industries and applications that require accurate speech-to-text conversion and speaker identification. Some potential use cases include:

- **Meeting and interview transcription**: Quickly generate transcripts with speaker attribution for remote or in-person meetings, interviews, and conferences.
- **Podcast and audio production**: Streamline the podcast production workflow by automatically generating transcripts and identifying different speakers.
- **Accessibility and subtitling**: Provide accurate, time-stamped captions for videos and audio content to improve accessibility.
- **Market research and customer service**: Analyze audio recordings of customer calls or focus groups to extract insights and improve product or service offerings.

## Things to try

One interesting aspect of `whisper-diarization` is its ability to handle multiple speakers and provide word-level timestamps. This can be particularly useful for applications that require speaker segmentation, such as conversation analysis or audio captioning. You could experiment with the `group_segments` and `num_speakers` parameters to see how they affect the model's performance on different types of audio content.

Another area to explore is the use of the `prompt` parameter to improve transcription accuracy. By providing relevant vocabulary, acronyms, or proper names, you can potentially boost the model's performance on domain-specific content, such as technical jargon or industry-specific terminology.