Using this open-source pipeline in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-diarization-31) Speaker diarization 3.1
========================================================

This pipeline is the same as [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote/speaker-diarization-3.1) except it removes the [problematic](https://github.com/pyannote/pyannote-audio/issues/1537) use of `onnxruntime`.  
Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference.  
It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:

*   stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
*   audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

[](#requirements)Requirements
-----------------------------

1.  Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.1` with `pip install pyannote.audio`
2.  Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3.  Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
4.  Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

[](#usage)Usage
---------------

    # instantiate the pipeline
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained(
      "pyannote/speaker-diarization-3.1",
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
    
    # run the pipeline on an audio file
    diarization = pipeline("audio.wav")
    
    # dump the diarization output to disk using RTTM format
    with open("audio.rttm", "w") as rttm:
        diarization.write_rttm(rttm)
    

### [](#processing-on-gpu)Processing on GPU

`pyannote.audio` pipelines run on CPU by default. You can send them to GPU with the following lines:

    import torch
    pipeline.to(torch.device("cuda"))
    

### [](#processing-from-memory)Processing from memory

Pre-loading audio files in memory may result in faster processing:

    waveform, sample_rate = torchaudio.load("audio.wav")
    diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
    

### [](#monitoring-progress)Monitoring progress

Hooks are available to monitor the progress of the pipeline:

    from pyannote.audio.pipelines.utils.hook import ProgressHook
    with ProgressHook() as hook:
        diarization = pipeline("audio.wav", hook=hook)
    

### [](#controlling-the-number-of-speakers)Controlling the number of speakers

In case the number of speakers is known in advance, one can use the `num_speakers` option:

    diarization = pipeline("audio.wav", num_speakers=2)
    

One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:

    diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
    

[](#benchmark)Benchmark
-----------------------

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

*   no manual voice activity detection (as is sometimes the case in the literature)
*   no manual number of speakers (though it is possible to provide it to the pipeline)
*   no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named _"Full"_ in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):

*   no forgiveness collar
*   evaluation of overlapped speech

Benchmark

[DER%](/pyannote/speaker-diarization-3.1/blob/main/. "Diarization error rate")

[FA%](/pyannote/speaker-diarization-3.1/blob/main/. "False alarm rate")

[Miss%](/pyannote/speaker-diarization-3.1/blob/main/. "Missed detection rate")

[Conf%](/pyannote/speaker-diarization-3.1/blob/main/. "Speaker confusion rate")

Expected output

File-level evaluation

[AISHELL-4](http://www.openslr.org/111/)

12.2

3.8

4.4

4.0

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)

[AliMeeting (_channel 1_)](https://www.openslr.org/119/)

24.4

4.4

10.0

10.0

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)

[AMI (_headset mix,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words_)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

18.8

3.6

9.5

5.7

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)

[AMI (_array1, channel 1,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words)_](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

22.4

3.8

11.2

7.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval)

[AVA-AVD](https://arxiv.org/abs/2111.14448)

50.0

10.8

15.7

23.4

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval)

[DIHARD 3 (_Full_)](https://arxiv.org/abs/2012.01477)

21.7

6.2

8.1

7.3

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)

[MSDWild](https://x-lance.github.io/MSDWILD/)

25.3

5.8

8.0

11.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)

[REPERE (_phase 2_)](https://islrn.org/resources/360-758-359-485-0/)

7.8

1.8

2.6

3.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)

[VoxConverse (_v0.3_)](https://github.com/joonson/voxconverse)

11.3

4.1

3.4

3.8

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)

[](#citations)Citations
-----------------------

    @inproceedings{Plaquet23,
      author={Alexis Plaquet and Herv Bredin},
      title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }
    

    @inproceedings{Bredin23,
      author={Herv Bredin},
      title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }

## Model overview

The `speaker-diarization-3.1` model is a pipeline developed by the `pyannote` team that performs speaker diarization on audio data. It is an updated version of the `speaker-diarization-3.0` model, removing the problematic use of `onnxruntime` and running the speaker segmentation and embedding entirely in PyTorch. This should ease deployment and potentially speed up inference.

The model takes in mono audio sampled at 16kHz and outputs speaker diarization as an `Annotation` instance. It can handle stereo or multi-channel audio by automatically downmixing to mono, and it can resample audio files to 16kHz upon loading.

Compared to the previous `speaker-diarization-3.0` model, this updated version should provide a smoother and more efficient experience for users integrating the model into their applications.

## Model inputs and outputs

### Inputs
- **Mono audio sampled at 16kHz**: The pipeline accepts a single-channel audio file sampled at 16kHz. It can automatically handle stereo or multi-channel audio by downmixing to mono.

### Outputs
- **Speaker diarization**: The pipeline outputs a `pyannote.core.Annotation` instance containing the speaker diarization for the input audio.

## Capabilities

The `speaker-diarization-3.1` model is capable of accurately segmenting and labeling different speakers within an audio recording. It can handle challenging scenarios like overlapping speech and varying numbers of speakers. The model has been benchmarked on a wide range of datasets, including [AISHELL-4](http://www.openslr.org/111/), [AliMeeting](https://www.openslr.org/119/), [AMI](https://groups.inf.ed.ac.uk/ami/corpus/), [AVA-AVD](https://arxiv.org/abs/2111.14448), [DIHARD 3](https://arxiv.org/abs/2012.01477), [MSDWild](https://x-lance.github.io/MSDWILD/), [REPERE](https://islrn.org/resources/360-758-359-485-0/), and [VoxConverse](https://github.com/joonson/voxconverse), demonstrating robust performance across diverse audio scenarios.

## What can I use it for?

The `speaker-diarization-3.1` model can be valuable for a variety of audio-based applications that require identifying and separating different speakers. Some potential use cases include:

- **Meeting transcription and analysis**: Automatically segmenting and labeling speakers in audio recordings of meetings, conferences, or interviews to facilitate post-processing and analysis.
- **Audio forensics and investigation**: Separating and identifying speakers in audio evidence to aid in investigations and legal proceedings.
- **Podcast and audio content production**: Streamlining the editing and post-production process for podcasts, audio books, and other multimedia content by automating speaker segmentation.
- **Conversational AI and voice assistants**: Improving the ability of voice-based systems to track and respond to multiple speakers in real-time conversations.

## Things to try

One interesting aspect of the `speaker-diarization-3.1` model is its ability to control the number of speakers expected in the audio. By using the `num_speakers`, `min_speakers`, and `max_speakers` options, you can fine-tune the model's behavior to better suit your specific use case. For example, if you know the audio you're processing will have a fixed number of speakers, you can set `num_speakers` to that value to potentially improve the model's accuracy.

Additionally, the model provides hooks for monitoring the progress of the pipeline, which can be useful for long-running or batch processing tasks. By using the `ProgressHook`, you can gain visibility into the model's performance and troubleshoot any issues that may arise.