Whisper transcription plus speaker diarization

## Model overview

The `speaker-transcription` model is a powerful AI system that combines speaker diarization and speech transcription capabilities. It was developed by Meronym, a creator on the Replicate platform. This model builds upon two main components: the `pyannote.audio` speaker diarization pipeline and OpenAI's `whisper` model for general-purpose English speech transcription.

The `speaker-transcription` model outperforms similar models like [whisper-diarization](https://aimodels.fyi/models/replicate/whisper-diarization-thomasmol) and [whisperx](https://aimodels.fyi/models/replicate/whisperx-daanelson) by providing more accurate speaker segmentation and identification, as well as high-quality transcription. It can be particularly useful for tasks that require both speaker information and verbatim transcripts, such as interview analysis, podcast processing, or meeting recordings.

## Model inputs and outputs

The `speaker-transcription` model takes an audio file as input and can optionally accept a prompt string to guide the transcription. The model outputs a JSON file containing the transcribed segments, with each segment associated with a speaker label and timestamps.

### Inputs
- **Audio**: An audio file in a supported format, such as MP3, AAC, FLAC, OGG, OPUS, or WAV.
- **Prompt (optional)**: A text prompt that can be used to provide additional context for the transcription.

### Outputs
- **JSON file**: A JSON file with the following structure:
  - `segments`: A list of transcribed segments, each with a `speaker` label, `start` and `stop` timestamps, and the segment `transcript`.
  - `speakers`: Information about the detected speakers, including the total `count`, `labels` for each speaker, and `embeddings` (a vector representation of each speaker's voice).

## Capabilities

The `speaker-transcription` model excels at accurately identifying and labeling different speakers within an audio recording, while also providing high-quality transcripts of the spoken content. This makes it a valuable tool for a variety of applications, such as interview analysis, podcast processing, or meeting recordings.

## What can I use it for?

The `speaker-transcription` model can be used for data augmentation and segmentation tasks, where the speaker information and timestamps can be used to improve the accuracy and effectiveness of transcription and captioning models. Additionally, the speaker embeddings generated by the model can be used for speaker recognition, allowing you to match voice profiles against a database of known speakers.

## Things to try

One interesting aspect of the `speaker-transcription` model is the ability to use a prompt to guide the transcription. By providing additional context about the topic or subject matter, you can potentially improve the accuracy and relevance of the transcripts. Try experimenting with different prompts to see how they affect the output.

Another useful feature is the generation of speaker embeddings, which can be used for speaker recognition and identification tasks. Consider exploring ways to leverage these embeddings, such as building a speaker verification system or clustering speakers in large audio datasets.