Segments an audio recording based on who is speaking

## Model overview

The `speaker-diarization` model from [Replicate](https://replicate.com/) creator [meronym](https://aimodels.fyi/creators/replicate/meronym) is a tool that segments an audio recording based on who is speaking. It is built using the open-source [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) library, which provides a set of trainable end-to-end neural building blocks for speaker diarization.

This model is similar to other speaker diarization models available, such as [speaker-diarization](https://aimodels.fyi/models/replicate/speaker-diarization-lucataco) and [whisper-diarization](https://aimodels.fyi/models/replicate/whisper-diarization-thomasmol), which also leverage the `pyannote.audio` library. However, the `speaker-diarization` model from meronym specifically uses a pre-trained pipeline that combines speaker segmentation, embedding, and clustering to identify individual speakers within the audio.

## Model inputs and outputs

### Inputs
- **audio**: The audio file to be processed, in a supported format such as MP3, AAC, FLAC, OGG, OPUS, or WAV.

### Outputs
- The model outputs a JSON file with the following structure:
  - `segments`: A list of diarization segments, each with a speaker label, start time, and end time.
  - `speakers`: An object containing the number of detected speakers, their labels, and 192-dimensional speaker embedding vectors.

## Capabilities

The `speaker-diarization` model is capable of automatically identifying individual speakers within an audio recording, even in cases where there is overlapping speech. It can handle a variety of audio formats and sample rates, and provides both segmentation information and speaker embeddings as output.

## What can I use it for?

This model can be useful for a variety of applications, such as:

- **Data Augmentation**: The speaker diarization output can be used to enhance transcription and captioning tasks by providing speaker-level segmentation.
- **Speaker Recognition**: The speaker embeddings generated by the model can be used to match against a database of known speakers, enabling speaker identification and verification.
- **Meeting and Interview Analysis**: The speaker diarization output can be used to analyze meeting recordings or interviews, providing insights into speaker participation, turn-taking, and interaction patterns.

## Things to try

One interesting aspect of the `speaker-diarization` model is its ability to handle overlapping speech. You could experiment with audio files that contain multiple speakers talking simultaneously, and observe how the model segments and labels the different speakers. Additionally, you could explore the use of the speaker embeddings for tasks like speaker clustering or identification, and see how the model's performance compares to other approaches.

Whisper transcription plus speaker diarization

## Model overview

The `speaker-transcription` model is a powerful AI system that combines speaker diarization and speech transcription capabilities. It was developed by Meronym, a creator on the Replicate platform. This model builds upon two main components: the `pyannote.audio` speaker diarization pipeline and OpenAI's `whisper` model for general-purpose English speech transcription.

The `speaker-transcription` model outperforms similar models like [whisper-diarization](https://aimodels.fyi/models/replicate/whisper-diarization-thomasmol) and [whisperx](https://aimodels.fyi/models/replicate/whisperx-daanelson) by providing more accurate speaker segmentation and identification, as well as high-quality transcription. It can be particularly useful for tasks that require both speaker information and verbatim transcripts, such as interview analysis, podcast processing, or meeting recordings.

## Model inputs and outputs

The `speaker-transcription` model takes an audio file as input and can optionally accept a prompt string to guide the transcription. The model outputs a JSON file containing the transcribed segments, with each segment associated with a speaker label and timestamps.

### Inputs
- **Audio**: An audio file in a supported format, such as MP3, AAC, FLAC, OGG, OPUS, or WAV.
- **Prompt (optional)**: A text prompt that can be used to provide additional context for the transcription.

### Outputs
- **JSON file**: A JSON file with the following structure:
  - `segments`: A list of transcribed segments, each with a `speaker` label, `start` and `stop` timestamps, and the segment `transcript`.
  - `speakers`: Information about the detected speakers, including the total `count`, `labels` for each speaker, and `embeddings` (a vector representation of each speaker's voice).

## Capabilities

The `speaker-transcription` model excels at accurately identifying and labeling different speakers within an audio recording, while also providing high-quality transcripts of the spoken content. This makes it a valuable tool for a variety of applications, such as interview analysis, podcast processing, or meeting recordings.

## What can I use it for?

The `speaker-transcription` model can be used for data augmentation and segmentation tasks, where the speaker information and timestamps can be used to improve the accuracy and effectiveness of transcription and captioning models. Additionally, the speaker embeddings generated by the model can be used for speaker recognition, allowing you to match voice profiles against a database of known speakers.

## Things to try

One interesting aspect of the `speaker-transcription` model is the ability to use a prompt to guide the transcription. By providing additional context about the topic or subject matter, you can potentially improve the accuracy and relevance of the transcripts. Try experimenting with different prompts to see how they affect the output.

Another useful feature is the generation of speaker embeddings, which can be used for speaker recognition and identification tasks. Consider exploring ways to leverage these embeddings, such as building a speaker verification system or clustering speakers in large audio datasets.