Using this open-source pipeline in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-diarization) Speaker diarization
=================================================

Relies on pyannote.audio 2.1.1: see [installation instructions](https://github.com/pyannote/pyannote-audio#installation).

[](#tldr)TL;DR
--------------

    # 1. visit hf.co/pyannote/speaker-diarization and accept user conditions
    # 2. visit hf.co/pyannote/segmentation and accept user conditions
    # 3. visit hf.co/settings/tokens to create an access token
    # 4. instantiate pretrained speaker diarization pipeline
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
                                        use_auth_token="ACCESS_TOKEN_GOES_HERE")
    
    
    # apply the pipeline to an audio file
    diarization = pipeline("audio.wav")
    
    # dump the diarization output to disk using RTTM format
    with open("audio.rttm", "w") as rttm:
        diarization.write_rttm(rttm)
    

[](#advanced-usage)Advanced usage
---------------------------------

In case the number of speakers is known in advance, one can use the `num_speakers` option:

    diarization = pipeline("audio.wav", num_speakers=2)
    

One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:

    diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
    

[](#benchmark)Benchmark
-----------------------

### [](#real-time-factor)Real-time factor

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 1.5 minutes to process a one hour conversation.

### [](#accuracy)Accuracy

This pipeline is benchmarked on a growing collection of datasets.

Processing is fully automatic:

*   no manual voice activity detection (as is sometimes the case in the literature)
*   no manual number of speakers (though it is possible to provide it to the pipeline)
*   no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named _"Full"_ in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):

*   no forgiveness collar
*   evaluation of overlapped speech

Benchmark

[DER%](/pyannote/speaker-diarization/blob/main/. "Diarization error rate")

[FA%](/pyannote/speaker-diarization/blob/main/. "False alarm rate")

[Miss%](/pyannote/speaker-diarization/blob/main/. "Missed detection rate")

[Conf%](/pyannote/speaker-diarization/blob/main/. "Speaker confusion rate")

Expected output

File-level evaluation

[AISHELL-4](http://www.openslr.org/111/)

14.09

5.17

3.27

5.65

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AISHELL.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AISHELL.test.eval)

[Albayzin (_RTVE 2022_)](http://catedrartve.unizar.es/albayzindatabases.html)

25.60

5.58

6.84

13.18

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Albayzin2022.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Albayzin2022.test.eval)

[AliMeeting (_channel 1_)](https://www.openslr.org/119/)

27.42

4.84

14.00

8.58

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AliMeeting.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AliMeeting.test.eval)

[AMI (_headset mix,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words_)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

18.91

4.48

9.51

4.91

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI.test.eval)

[AMI (_array1, channel 1,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words)_](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

27.12

4.11

17.78

5.23

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI-SDM.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/AMI-SDM.test.eval)

[CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) [(_part2_)](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)

32.37

6.30

13.72

12.35

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/CALLHOME.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/CALLHOME.test.eval)

[DIHARD 3 (_Full_)](https://arxiv.org/abs/2012.01477)

26.94

10.50

8.41

8.03

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/DIHARD.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/DIHARD.test.eval)

[Ego4D _v1 (validation)_](https://arxiv.org/abs/2110.07058)

63.99

3.91

44.42

15.67

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Ego4D.development.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/Ego4D.development.eval)

[REPERE (_phase 2_)](https://islrn.org/resources/360-758-359-485-0/)

8.17

2.23

2.49

3.45

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/REPERE.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/REPERE.test.eval)

[This American Life](https://arxiv.org/abs/2005.08072)

20.82

2.03

11.89

6.90

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/ThisAmericanLife.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/2.1.1/reproducible_research/2.1.1/ThisAmericanLife.test.eval)

[VoxConverse (_v0.3_)](https://github.com/joonson/voxconverse)

11.24

4.42

2.88

3.94

[RTTM](https://huggingface.co/pyannote/speaker-diarization/blob/main/reproducible_research/2.1.1/VoxConverse.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization/blob/main/reproducible_research/2.1.1/VoxConverse.test.eval)

[](#technical-report)Technical report
-------------------------------------

This [report](/pyannote/speaker-diarization/blob/main/technical_report_2.1.pdf) describes the main principles behind version `2.1` of pyannote.audio speaker diarization pipeline.  
It also provides recipes explaining how to adapt the pipeline to your own set of annotated data. In particular, those are applied to the above benchmark and consistently leads to significant performance improvement over the above out-of-the-box performance.

[](#citations)Citations
-----------------------

    @inproceedings{Bredin2021,
      Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
      Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
      Booktitle = {Proc. Interspeech 2021},
      Address = {Brno, Czech Republic},
      Month = {August},
      Year = {2021},
    }
    

    @inproceedings{Bredin2020,
      Title = {{pyannote.audio: neural building blocks for speaker diarization}},
      Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
      Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
      Address = {Barcelona, Spain},
      Month = {May},
      Year = {2020},
    }

## Model overview

The `speaker-diarization` model is an open-source pipeline created by [pyannote](https://aimodels.fyi/creators/huggingFace/pyannote), a company that provides AI consulting services. The model is used for speaker diarization, which is the process of partitioning an audio recording into homogeneous segments according to the speaker identity. This is useful for applications like meeting transcription, where it's important to know which speaker said what.

The model relies on the [pyannote.audio](https://github.com/pyannote/pyannote-audio) library, which provides a set of neural network-based building blocks for speaker diarization. The pipeline comes pre-trained and can be used off-the-shelf without the need for further fine-tuning.

## Model inputs and outputs

### Inputs
- **Audio file**: The audio file to be processed for speaker diarization.

### Outputs
- **Diarization**: The output of the speaker diarization process, which includes information about the start and end times of each speaker's turn, as well as the speaker labels. The output can be saved in the RTTM (Rich Transcription Time Marked) format.

## Capabilities

The `speaker-diarization` model is a fully automatic pipeline that doesn't require any manual intervention, such as manual voice activity detection or manual specification of the number of speakers. It is benchmarked on a growing collection of datasets and achieves high accuracy, with low diarization error rates even in the presence of overlapped speech.

## What can I use it for?

The `speaker-diarization` model can be used in various applications that involve audio processing, such as meeting transcription, audio indexing, and speaker attribution in podcasts or interviews. By automatically separating the audio into speaker turns, the model can greatly simplify the process of transcribing and analyzing audio recordings.

## Things to try

One interesting aspect of the `speaker-diarization` model is its ability to handle a variable number of speakers. If the number of speakers is known in advance, you can provide this information to the model using the `num_speakers` option. Alternatively, you can specify a range for the number of speakers using the `min_speakers` and `max_speakers` options.

Another feature to explore is the model's real-time performance. The pipeline is benchmarked to have a real-time factor of around 2.5%, meaning it can process a one-hour conversation in approximately 1.5 minutes. This makes the model suitable for near-real-time applications, where fast processing is essential.

Using this open-source model in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-segmentation) Speaker segmentation
===================================================

[Paper](http://arxiv.org/abs/2104.04045) | [Demo](https://huggingface.co/spaces/pyannote/pretrained-pipelines) | [Blog post](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)

[![Example](/pyannote/segmentation/media/main/example.png)](/pyannote/segmentation/blob/main/example.png)

[](#usage)Usage
---------------

Relies on pyannote.audio 2.1.1: see [installation instructions](https://github.com/pyannote/pyannote-audio).

    # 1. visit hf.co/pyannote/segmentation and accept user conditions
    # 2. visit hf.co/settings/tokens to create an access token
    # 3. instantiate pretrained model
    from pyannote.audio import Model
    model = Model.from_pretrained("pyannote/segmentation", 
                                  use_auth_token="ACCESS_TOKEN_GOES_HERE")
    

### [](#voice-activity-detection)Voice activity detection

    from pyannote.audio.pipelines import VoiceActivityDetection
    pipeline = VoiceActivityDetection(segmentation=model)
    HYPER_PARAMETERS = {
      # onset/offset activation thresholds
      "onset": 0.5, "offset": 0.5,
      # remove speech regions shorter than that many seconds.
      "min_duration_on": 0.0,
      # fill non-speech regions shorter than that many seconds.
      "min_duration_off": 0.0
    }
    pipeline.instantiate(HYPER_PARAMETERS)
    vad = pipeline("audio.wav")
    # `vad` is a pyannote.core.Annotation instance containing speech regions
    

### [](#overlapped-speech-detection)Overlapped speech detection

    from pyannote.audio.pipelines import OverlappedSpeechDetection
    pipeline = OverlappedSpeechDetection(segmentation=model)
    pipeline.instantiate(HYPER_PARAMETERS)
    osd = pipeline("audio.wav")
    # `osd` is a pyannote.core.Annotation instance containing overlapped speech regions
    

### [](#resegmentation)Resegmentation

    from pyannote.audio.pipelines import Resegmentation
    pipeline = Resegmentation(segmentation=model, 
                              diarization="baseline")
    pipeline.instantiate(HYPER_PARAMETERS)
    resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
    # where `baseline` should be provided as a pyannote.core.Annotation instance
    

### [](#raw-scores)Raw scores

    from pyannote.audio import Inference
    inference = Inference(model)
    segmentation = inference("audio.wav")
    # `segmentation` is a pyannote.core.SlidingWindowFeature
    # instance containing raw segmentation scores like the 
    # one pictured above (output)
    

[](#citation)Citation
---------------------

    @inproceedings{Bredin2021,
      Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
      Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
      Booktitle = {Proc. Interspeech 2021},
      Address = {Brno, Czech Republic},
      Month = {August},
      Year = {2021},
    

    @inproceedings{Bredin2020,
      Title = {{pyannote.audio: neural building blocks for speaker diarization}},
      Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
      Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
      Address = {Barcelona, Spain},
      Month = {May},
      Year = {2020},
    }
    

[](#reproducible-research)Reproducible research
-----------------------------------------------

In order to reproduce the results of the paper ["End-to-end speaker segmentation for overlap-aware resegmentation "](https://arxiv.org/abs/2104.04045), use `pyannote/segmentation@Interspeech2021` with the following hyper-parameters:

Voice activity detection

`onset`

`offset`

`min_duration_on`

`min_duration_off`

AMI Mix-Headset

0.684

0.577

0.181

0.037

DIHARD3

0.767

0.377

0.136

0.067

VoxConverse

0.767

0.713

0.182

0.501

Overlapped speech detection

`onset`

`offset`

`min_duration_on`

`min_duration_off`

AMI Mix-Headset

0.448

0.362

0.116

0.187

DIHARD3

0.430

0.320

0.091

0.144

VoxConverse

0.587

0.426

0.337

0.112

Resegmentation of VBx

`onset`

`offset`

`min_duration_on`

`min_duration_off`

AMI Mix-Headset

0.542

0.527

0.044

0.705

DIHARD3

0.592

0.489

0.163

0.182

VoxConverse

0.537

0.724

0.410

0.563

Expected outputs (and VBx baseline) are also provided in the `/reproducible_research` sub-directories.

## Model overview

The `segmentation` model from PyAnnote is an open-source model for speaker segmentation. It can perform tasks like voice activity detection, overlapped speech detection, and resegmentation. The model was trained using the techniques described in the [End-to-end speaker segmentation for overlap-aware resegmentation](http://arxiv.org/abs/2104.04045) paper. Similar models from PyAnnote include the [speaker-diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) pipeline, which can perform full speaker diarization.

## Model inputs and outputs

The `segmentation` model takes audio samples as input and outputs speaker segmentation information. This can include the start and end times of speech regions, as well as indications of overlapping speech.

### Inputs
- **Audio samples**: The model accepts raw audio data as input, which can be loaded using tools like `torchaudio` or `librosa`.

### Outputs
- **Speech regions**: The model outputs a `pyannote.core.Annotation` instance containing the start and end times of detected speech regions.
- **Overlapped speech regions**: The model can also output a `pyannote.core.Annotation` instance containing regions of overlapping speech.
- **Raw segmentation scores**: The model can provide the raw segmentation scores as a `pyannote.core.SlidingWindowFeature` instance, which can be useful for further analysis.

## Capabilities

The `segmentation` model from PyAnnote can perform a variety of speaker-related tasks beyond just basic voice activity detection. It can identify overlapping speech, which is useful for more accurate diarization, and can also be used for resegmentation of existing diarization output.

## What can I use it for?

The `segmentation` model could be used in a variety of applications that require speaker-level information from audio, such as:

- Automatic transcription and captioning tools
- Audio-based analytics and customer service applications
- Podcast and meeting processing pipelines
- Enhancing existing speaker diarization systems

PyAnnote also offers [consulting services](https://herve.niderb.fr/consulting.html) to help users make the most of their open-source models in production.

## Things to try

One interesting aspect of the `segmentation` model is its ability to output raw segmentation scores, which can be useful for further analysis and experimentation. For example, you could try visualizing the segmentation scores over time to better understand the model's decision-making process.

Additionally, the model's overlap detection capabilities could be leveraged to improve downstream tasks like speaker diarization or meeting summarization. By being aware of regions with overlapping speech, the model can help create more accurate speaker profiles and transcripts.

Using this open-source pipeline in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-diarization-31) Speaker diarization 3.1
========================================================

This pipeline is the same as [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote/speaker-diarization-3.1) except it removes the [problematic](https://github.com/pyannote/pyannote-audio/issues/1537) use of `onnxruntime`.  
Both speaker segmentation and embedding now run in pure PyTorch. This should ease deployment and possibly speed up inference.  
It requires pyannote.audio version 3.1 or higher.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:

*   stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
*   audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

[](#requirements)Requirements
-----------------------------

1.  Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.1` with `pip install pyannote.audio`
2.  Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3.  Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
4.  Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

[](#usage)Usage
---------------

    # instantiate the pipeline
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained(
      "pyannote/speaker-diarization-3.1",
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
    
    # run the pipeline on an audio file
    diarization = pipeline("audio.wav")
    
    # dump the diarization output to disk using RTTM format
    with open("audio.rttm", "w") as rttm:
        diarization.write_rttm(rttm)
    

### [](#processing-on-gpu)Processing on GPU

`pyannote.audio` pipelines run on CPU by default. You can send them to GPU with the following lines:

    import torch
    pipeline.to(torch.device("cuda"))
    

### [](#processing-from-memory)Processing from memory

Pre-loading audio files in memory may result in faster processing:

    waveform, sample_rate = torchaudio.load("audio.wav")
    diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
    

### [](#monitoring-progress)Monitoring progress

Hooks are available to monitor the progress of the pipeline:

    from pyannote.audio.pipelines.utils.hook import ProgressHook
    with ProgressHook() as hook:
        diarization = pipeline("audio.wav", hook=hook)
    

### [](#controlling-the-number-of-speakers)Controlling the number of speakers

In case the number of speakers is known in advance, one can use the `num_speakers` option:

    diarization = pipeline("audio.wav", num_speakers=2)
    

One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:

    diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
    

[](#benchmark)Benchmark
-----------------------

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

*   no manual voice activity detection (as is sometimes the case in the literature)
*   no manual number of speakers (though it is possible to provide it to the pipeline)
*   no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named _"Full"_ in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):

*   no forgiveness collar
*   evaluation of overlapped speech

Benchmark

[DER%](/pyannote/speaker-diarization-3.1/blob/main/. "Diarization error rate")

[FA%](/pyannote/speaker-diarization-3.1/blob/main/. "False alarm rate")

[Miss%](/pyannote/speaker-diarization-3.1/blob/main/. "Missed detection rate")

[Conf%](/pyannote/speaker-diarization-3.1/blob/main/. "Speaker confusion rate")

Expected output

File-level evaluation

[AISHELL-4](http://www.openslr.org/111/)

12.2

3.8

4.4

4.0

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)

[AliMeeting (_channel 1_)](https://www.openslr.org/119/)

24.4

4.4

10.0

10.0

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)

[AMI (_headset mix,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words_)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

18.8

3.6

9.5

5.7

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)

[AMI (_array1, channel 1,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words)_](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

22.4

3.8

11.2

7.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval)

[AVA-AVD](https://arxiv.org/abs/2111.14448)

50.0

10.8

15.7

23.4

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval)

[DIHARD 3 (_Full_)](https://arxiv.org/abs/2012.01477)

21.7

6.2

8.1

7.3

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)

[MSDWild](https://x-lance.github.io/MSDWILD/)

25.3

5.8

8.0

11.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)

[REPERE (_phase 2_)](https://islrn.org/resources/360-758-359-485-0/)

7.8

1.8

2.6

3.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)

[VoxConverse (_v0.3_)](https://github.com/joonson/voxconverse)

11.3

4.1

3.4

3.8

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)

[](#citations)Citations
-----------------------

    @inproceedings{Plaquet23,
      author={Alexis Plaquet and Herv Bredin},
      title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }
    

    @inproceedings{Bredin23,
      author={Herv Bredin},
      title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }

## Model overview

The `speaker-diarization-3.1` model is a pipeline developed by the `pyannote` team that performs speaker diarization on audio data. It is an updated version of the `speaker-diarization-3.0` model, removing the problematic use of `onnxruntime` and running the speaker segmentation and embedding entirely in PyTorch. This should ease deployment and potentially speed up inference.

The model takes in mono audio sampled at 16kHz and outputs speaker diarization as an `Annotation` instance. It can handle stereo or multi-channel audio by automatically downmixing to mono, and it can resample audio files to 16kHz upon loading.

Compared to the previous `speaker-diarization-3.0` model, this updated version should provide a smoother and more efficient experience for users integrating the model into their applications.

## Model inputs and outputs

### Inputs
- **Mono audio sampled at 16kHz**: The pipeline accepts a single-channel audio file sampled at 16kHz. It can automatically handle stereo or multi-channel audio by downmixing to mono.

### Outputs
- **Speaker diarization**: The pipeline outputs a `pyannote.core.Annotation` instance containing the speaker diarization for the input audio.

## Capabilities

The `speaker-diarization-3.1` model is capable of accurately segmenting and labeling different speakers within an audio recording. It can handle challenging scenarios like overlapping speech and varying numbers of speakers. The model has been benchmarked on a wide range of datasets, including [AISHELL-4](http://www.openslr.org/111/), [AliMeeting](https://www.openslr.org/119/), [AMI](https://groups.inf.ed.ac.uk/ami/corpus/), [AVA-AVD](https://arxiv.org/abs/2111.14448), [DIHARD 3](https://arxiv.org/abs/2012.01477), [MSDWild](https://x-lance.github.io/MSDWILD/), [REPERE](https://islrn.org/resources/360-758-359-485-0/), and [VoxConverse](https://github.com/joonson/voxconverse), demonstrating robust performance across diverse audio scenarios.

## What can I use it for?

The `speaker-diarization-3.1` model can be valuable for a variety of audio-based applications that require identifying and separating different speakers. Some potential use cases include:

- **Meeting transcription and analysis**: Automatically segmenting and labeling speakers in audio recordings of meetings, conferences, or interviews to facilitate post-processing and analysis.
- **Audio forensics and investigation**: Separating and identifying speakers in audio evidence to aid in investigations and legal proceedings.
- **Podcast and audio content production**: Streamlining the editing and post-production process for podcasts, audio books, and other multimedia content by automating speaker segmentation.
- **Conversational AI and voice assistants**: Improving the ability of voice-based systems to track and respond to multiple speakers in real-time conversations.

## Things to try

One interesting aspect of the `speaker-diarization-3.1` model is its ability to control the number of speakers expected in the audio. By using the `num_speakers`, `min_speakers`, and `max_speakers` options, you can fine-tune the model's behavior to better suit your specific use case. For example, if you know the audio you're processing will have a fixed number of speakers, you can set `num_speakers` to that value to potentially improve the model's accuracy.

Additionally, the model provides hooks for monitoring the progress of the pipeline, which can be useful for long-running or batch processing tasks. By using the `ProgressHook`, you can gain visibility into the model's performance and troubleshoot any issues that may arise.

Using this open-source pipeline in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-diarization-30) Speaker diarization 3.0
========================================================

This pipeline has been trained by Sverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.0.0` using a combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance:

*   stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
*   audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

[](#requirements)Requirements
-----------------------------

1.  Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
2.  Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3.  Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
4.  Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

[](#usage)Usage
---------------

    # instantiate the pipeline
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained(
      "pyannote/speaker-diarization-3.0",
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
    
    # run the pipeline on an audio file
    diarization = pipeline("audio.wav")
    
    # dump the diarization output to disk using RTTM format
    with open("audio.rttm", "w") as rttm:
        diarization.write_rttm(rttm)
    

### [](#processing-on-gpu)Processing on GPU

`pyannote.audio` pipelines run on CPU by default. You can send them to GPU with the following lines:

    import torch
    pipeline.to(torch.device("cuda"))
    

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 1.5 minutes to process a one hour conversation.

### [](#processing-from-memory)Processing from memory

Pre-loading audio files in memory may result in faster processing:

    waveform, sample_rate = torchaudio.load("audio.wav")
    diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
    

### [](#monitoring-progress)Monitoring progress

Hooks are available to monitor the progress of the pipeline:

    from pyannote.audio.pipelines.utils.hook import ProgressHook
    with ProgressHook() as hook:
        diarization = pipeline("audio.wav", hook=hook)
    

### [](#controlling-the-number-of-speakers)Controlling the number of speakers

In case the number of speakers is known in advance, one can use the `num_speakers` option:

    diarization = pipeline("audio.wav", num_speakers=2)
    

One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:

    diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
    

[](#benchmark)Benchmark
-----------------------

This pipeline has been benchmarked on a large collection of datasets.

Processing is fully automatic:

*   no manual voice activity detection (as is sometimes the case in the literature)
*   no manual number of speakers (though it is possible to provide it to the pipeline)
*   no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset

... with the least forgiving diarization error rate (DER) setup (named _"Full"_ in [this paper](https://doi.org/10.1016/j.csl.2021.101254)):

*   no forgiveness collar
*   evaluation of overlapped speech

Benchmark

[DER%](/pyannote/speaker-diarization-3.0/blob/main/. "Diarization error rate")

[FA%](/pyannote/speaker-diarization-3.0/blob/main/. "False alarm rate")

[Miss%](/pyannote/speaker-diarization-3.0/blob/main/. "Missed detection rate")

[Conf%](/pyannote/speaker-diarization-3.0/blob/main/. "Speaker confusion rate")

Expected output

File-level evaluation

[AISHELL-4](http://www.openslr.org/111/)

12.3

3.8

4.4

4.1

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)

[AliMeeting (_channel 1_)](https://www.openslr.org/119/)

24.3

4.4

10.0

9.9

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)

[AMI (_headset mix,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words_)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

19.0

3.6

9.5

5.9

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)

[AMI (_array1, channel 1,_](https://groups.inf.ed.ac.uk/ami/corpus/) [_only\_words)_](https://github.com/BUTSpeechFIT/AMI-diarization-setup)

22.2

3.8

11.2

7.3

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval)

[AVA-AVD](https://arxiv.org/abs/2111.14448)

49.1

10.8

15.7

22.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval)

[DIHARD 3 (_Full_)](https://arxiv.org/abs/2012.01477)

21.7

6.2

8.1

7.3

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)

[MSDWild](https://x-lance.github.io/MSDWILD/)

24.6

5.8

8.0

10.7

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)

[REPERE (_phase 2_)](https://islrn.org/resources/360-758-359-485-0/)

7.8

1.8

2.6

3.5

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)

[VoxConverse (_v0.3_)](https://github.com/joonson/voxconverse)

11.3

4.1

3.4

3.8

[RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)

[eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)

[](#citations)Citations
-----------------------

    @inproceedings{Plaquet23,
      author={Alexis Plaquet and Herv Bredin},
      title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }
    

    @inproceedings{Bredin23,
      author={Herv Bredin},
      title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }

## Model overview

The `speaker-diarization-3.0` model is an open-source pipeline for speaker diarization, trained by Sverin Baroudi using the [pyannote.audio](https://github.com/pyannote/pyannote-audio) library version 3.0.0. It takes in mono audio sampled at 16kHz and outputs speaker diarization as an `Annotation` instance, which can be used to identify who is speaking when in the audio. The pipeline was trained on a combination of several popular speech datasets, including AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

The model is similar to the [speaker-diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) model, which uses an earlier version of the pyannote.audio library. Both models aim to perform the task of speaker diarization, identifying who is speaking when in an audio recording.

## Model inputs and outputs

### Inputs
- Mono audio sampled at 16kHz

### Outputs
- An `Annotation` instance containing the speaker diarization information, which can be used to identify when each speaker is talking.

## Capabilities

The `speaker-diarization-3.0` model can effectively identify speakers and when they are talking in a given audio recording. It can handle stereo or multi-channel audio by automatically downmixing to mono, and can also resample audio files to 16kHz if needed. The model achieves strong performance, with a diarization error rate (DER) of around 14% on the AISHELL-4 dataset.

## What can I use it for?

The `speaker-diarization-3.0` model can be useful for a variety of applications that require identifying speakers in audio, such as:

- Transcription and captioning for meetings or interviews
- Speaker tracking in security or surveillance applications
- Audience analysis for podcasts or other audio content
- Improving speech recognition systems by leveraging speaker information

The maintainers of the model also offer [consulting services](https://herve.niderb.fr/consulting.html) for organizations looking to use this pipeline in production.

## Things to try

One interesting aspect of the `speaker-diarization-3.0` model is its ability to process audio on GPU, which can significantly improve the inference speed. The model achieves a real-time factor of around 2.5% when running on a single Nvidia Tesla V100 SXM2 GPU, meaning it can process a one-hour conversation in about 1.5 minutes.

Developers can also experiment with running the model directly from memory, which may provide further performance improvements. The pipeline also offers hooks to monitor the progress of the diarization process, which can be useful for debugging and understanding the model's behavior.

Using this open-source model in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-powerset-speaker-segmentation) "Powerset" speaker segmentation
=======================================================================

This model ingests 10 seconds of mono audio sampled at 16kHz and outputs speaker diarization as a (num\_frames, num\_classes) matrix where the 7 classes are _non-speech_, _speaker #1_, _speaker #2_, _speaker #3_, _speakers #1 and #2_, _speakers #1 and #3_, and _speakers #2 and #3_.

[![Example output](/pyannote/segmentation-3.0/media/main/example.png)](/pyannote/segmentation-3.0/blob/main/example.png)

    # waveform (first row)
    duration, sample_rate, num_channels = 10, 16000, 1
    waveform = torch.randn(batch_size, num_channels, duration * sample_rate) 
    
    # powerset multi-class encoding (second row)
    powerset_encoding = model(waveform)
    
    # multi-label encoding (third row)
    from pyannote.audio.utils.powerset import Powerset
    max_speakers_per_chunk, max_speakers_per_frame = 3, 2
    to_multilabel = Powerset(
        max_speakers_per_chunk, 
        max_speakers_per_frame).to_multilabel
    multilabel_encoding = to_multilabel(powerset_encoding)
    

The various concepts behind this model are described in details in this [paper](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html).

It has been trained by Sverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.0.0` using the combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse.

This [companion repository](https://github.com/FrenchKrab/IS2023-powerset-diarization/) by [Alexis Plaquet](https://frenchkrab.github.io/) also provides instructions on how to train or finetune such a model on your own data.

[](#requirements)Requirements
-----------------------------

1.  Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
2.  Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3.  Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

[](#usage)Usage
---------------

    # instantiate the model
    from pyannote.audio import Model
    model = Model.from_pretrained(
      "pyannote/segmentation-3.0", 
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
    

### [](#speaker-diarization)Speaker diarization

This model cannot be used to perform speaker diarization of full recordings on its own (it only processes 10s chunks).

See [pyannote/speaker-diarization-3.0](https://hf.co/pyannote/speaker-diarization-3.0) pipeline that uses an additional speaker embedding model to perform full recording speaker diarization.

### [](#voice-activity-detection)Voice activity detection

    from pyannote.audio.pipelines import VoiceActivityDetection
    pipeline = VoiceActivityDetection(segmentation=model)
    HYPER_PARAMETERS = {
      # remove speech regions shorter than that many seconds.
      "min_duration_on": 0.0,
      # fill non-speech regions shorter than that many seconds.
      "min_duration_off": 0.0
    }
    pipeline.instantiate(HYPER_PARAMETERS)
    vad = pipeline("audio.wav")
    # `vad` is a pyannote.core.Annotation instance containing speech regions
    

### [](#overlapped-speech-detection)Overlapped speech detection

    from pyannote.audio.pipelines import OverlappedSpeechDetection
    pipeline = OverlappedSpeechDetection(segmentation=model)
    HYPER_PARAMETERS = {
      # remove overlapped speech regions shorter than that many seconds.
      "min_duration_on": 0.0,
      # fill non-overlapped speech regions shorter than that many seconds.
      "min_duration_off": 0.0
    }
    pipeline.instantiate(HYPER_PARAMETERS)
    osd = pipeline("audio.wav")
    # `osd` is a pyannote.core.Annotation instance containing overlapped speech regions
    

[](#citations)Citations
-----------------------

    @inproceedings{Plaquet23,
      author={Alexis Plaquet and Herv Bredin},
      title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }
    

    @inproceedings{Bredin23,
      author={Herv Bredin},
      title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
      year=2023,
      booktitle={Proc. INTERSPEECH 2023},
    }

## Model overview

The `segmentation-3.0` model from [pyannote](https://aimodels.fyi/creators/huggingFace/pyannote) is an open-source speaker segmentation model that can identify up to 3 speakers in a 10-second audio clip. It was trained on a combination of datasets including AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse. This model builds on the [speaker segmentation](https://aimodels.fyi/models/huggingFace/segmentation-pyannote) and [speaker diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) models previously released by pyannote.

## Model inputs and outputs

### Inputs
- 10 seconds of mono audio sampled at 16kHz

### Outputs
- A (num_frames, num_classes) matrix where the 7 classes are _non-speech_, _speaker #1_, _speaker #2_, _speaker #3_, _speakers #1 and #2_, _speakers #1 and #3_, and _speakers #2 and #3_.

## Capabilities

The `segmentation-3.0` model can identify up to 3 speakers in a 10-second audio clip, including cases where multiple speakers are present. This makes it useful for various speech processing tasks such as voice activity detection, overlapped speech detection, and resegmentation.

## What can I use it for?

The `segmentation-3.0` model can be used as a building block in speech and audio processing pipelines, such as the [speaker diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) pipeline also provided by pyannote. By integrating this model, you can create more robust and accurate speaker diarization systems that can handle overlapping speech.

## Things to try

One interesting thing to try with the `segmentation-3.0` model is to fine-tune it on your own data using the [companion repository](https://github.com/FrenchKrab/IS2023-powerset-diarization/) provided by Alexis Plaquet. This can help adapt the model to your specific use case and potentially improve its performance on your data.

I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular).

[](#-voice-activity-detection) Voice activity detection
===========================================================

Relies on pyannote.audio 2.1: see [installation instructions](https://github.com/pyannote/pyannote-audio#installation).

    # 1. visit hf.co/pyannote/segmentation and accept user conditions
    # 2. visit hf.co/settings/tokens to create an access token
    # 3. instantiate pretrained voice activity detection pipeline
    
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained("pyannote/voice-activity-detection",
                                        use_auth_token="ACCESS_TOKEN_GOES_HERE")
    output = pipeline("audio.wav")
    
    for speech in output.get_timeline().support():
        # active speech between speech.start and speech.end
        ...
    

[](#citation)Citation
---------------------

    @inproceedings{Bredin2021,
      Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
      Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
      Booktitle = {Proc. Interspeech 2021},
      Address = {Brno, Czech Republic},
      Month = {August},
      Year = {2021},
    }
    

    @inproceedings{Bredin2020,
      Title = {{pyannote.audio: neural building blocks for speaker diarization}},
      Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
      Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
      Address = {Barcelona, Spain},
      Month = {May},
      Year = {2020},
    }

## Model overview

The `voice-activity-detection` model from the pyannote project is a powerful tool for identifying speech regions in audio. This model builds upon the pyannote.audio library, which provides a range of open-source speech processing tools. The maintainer, Hervé Niderb, offers [paid consulting services](https://herve.niderb.fr/consulting.html) to companies looking to leverage these tools in their own applications.

Similar models provided by pyannote include [segmentation](https://aimodels.fyi/models/huggingFace/segmentation-pyannote), which performs speaker segmentation, and [speaker-diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote), which identifies individual speakers within an audio recording. These models share the same underlying architecture and can be used in conjunction to provide a comprehensive speech processing pipeline.

## Model inputs and outputs

### Inputs
- **Audio file**: The `voice-activity-detection` model takes a mono audio file sampled at 16kHz as input.

### Outputs
- **Speech regions**: The model outputs an `Annotation` instance, which contains information about the start and end times of detected speech regions in the input audio.

## Capabilities

The `voice-activity-detection` model is highly capable at identifying speech within audio recordings, even in the presence of background noise or overlapping speakers. By leveraging the pyannote.audio library, this model can be easily integrated into a wide range of speech processing applications, such as transcription, speaker diarization, and audio indexing.

## What can I use it for?

The `voice-activity-detection` model can be a valuable tool for companies looking to extract meaningful insights from audio data. For example, it could be used to automatically generate transcripts of meetings or podcasts, or to identify relevant audio segments for further processing, such as speaker diarization or emotion analysis.

## Things to try

One interesting application of the `voice-activity-detection` model could be to use it as a preprocessing step for other speech-related tasks. By first identifying the speech regions in an audio file, you can then focus your subsequent processing on these relevant portions, potentially improving the overall performance and efficiency of your system.

Using this open-source model in production?  
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

[](#-speaker-embedding) Speaker embedding
=============================================

Relies on pyannote.audio 2.1: see [installation instructions](https://github.com/pyannote/pyannote-audio/).

This model is based on the [canonical x-vector TDNN-based architecture](https://ieeexplore.ieee.org/abstract/document/8461375), but with filter banks replaced with [trainable SincNet features](https://ieeexplore.ieee.org/document/8639585). See [`XVectorSincNet`](https://github.com/pyannote/pyannote-audio/blob/3c988c028dc505c64fe776720372f6fe816b585a/pyannote/audio/models/embedding/xvector.py#L104-L169) architecture for implementation details.

[](#basic-usage)Basic usage
---------------------------

    # 1. visit hf.co/pyannote/embedding and accept user conditions
    # 2. visit hf.co/settings/tokens to create an access token
    # 3. instantiate pretrained model
    from pyannote.audio import Model
    model = Model.from_pretrained("pyannote/embedding", 
                                  use_auth_token="ACCESS_TOKEN_GOES_HERE")
    

    from pyannote.audio import Inference
    inference = Inference(model, window="whole")
    embedding1 = inference("speaker1.wav")
    embedding2 = inference("speaker2.wav")
    # `embeddingX` is (1 x D) numpy array extracted from the file as a whole.
    
    from scipy.spatial.distance import cdist
    distance = cdist(embedding1, embedding2, metric="cosine")[0,0]
    # `distance` is a `float` describing how dissimilar speakers 1 and 2 are.
    

Using cosine distance directly, this model reaches 2.8% equal error rate (EER) on VoxCeleb 1 test set.  
This is without voice activity detection (VAD) nor probabilistic linear discriminant analysis (PLDA). Expect even better results when adding one of those.

[](#advanced-usage)Advanced usage
---------------------------------

### [](#running-on-gpu)Running on GPU

    import torch
    inference.to(torch.device("cuda"))
    embedding = inference("audio.wav")
    

### [](#extract-embedding-from-an-excerpt)Extract embedding from an excerpt

    from pyannote.audio import Inference
    from pyannote.core import Segment
    inference = Inference(model, window="whole")
    excerpt = Segment(13.37, 19.81)
    embedding = inference.crop("audio.wav", excerpt)
    # `embedding` is (1 x D) numpy array extracted from the file excerpt.
    

### [](#extract-embeddings-using-a-sliding-window)Extract embeddings using a sliding window

    from pyannote.audio import Inference
    inference = Inference(model, window="sliding",
                          duration=3.0, step=1.0)
    embeddings = inference("audio.wav")
    # `embeddings` is a (N x D) pyannote.core.SlidingWindowFeature
    # `embeddings[i]` is the embedding of the ith position of the 
    # sliding window, i.e. from [i * step, i * step + duration].
    

[](#citation)Citation
---------------------

    @inproceedings{Bredin2020,
      Title = {{pyannote.audio: neural building blocks for speaker diarization}},
      Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
      Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
      Address = {Barcelona, Spain},
      Month = {May},
      Year = {2020},
    }
    

    @inproceedings{Coria2020,
        author="Coria, Juan M. and Bredin, Herv{\'e} and Ghannay, Sahar and Rosset, Sophie",
        editor="Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena",
        title="{A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification}",
        booktitle="Statistical Language and Speech Processing",
        year="2020",
        publisher="Springer International Publishing",
        pages="137--148",
        isbn="978-3-030-59430-5"
    }

## Model overview

The `embedding` model from `pyannote` is a speaker embedding model that uses the canonical x-vector TDNN-based architecture, but with filter banks replaced by trainable SincNet features. This model reaches 2.8% equal error rate (EER) on the VoxCeleb 1 test set without any additional processing like voice activity detection (VAD) or probabilistic linear discriminant analysis (PLDA). Compared to similar models like the [segmentation](https://aimodels.fyi/models/huggingFace/segmentation-pyannote) and [speaker-diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) models from `pyannote`, the `embedding` model focuses specifically on extracting speaker embeddings from audio.

## Model inputs and outputs

The `embedding` model takes in an audio file and outputs a numpy array representing the speaker embedding for the entire file. This embedding can then be used for tasks like speaker verification, where you can compare the embeddings of two speakers to determine how similar they are.

### Inputs
- **Audio file**: The model accepts a single audio file as input, which can be in any format supported by the underlying audio library.

### Outputs
- **Speaker embedding**: The model outputs a numpy array of shape `(1, D)`, where `D` is the dimensionality of the speaker embedding. This embedding represents the speaker characteristics extracted from the entire input audio file.

## Capabilities

The `embedding` model is capable of extracting robust speaker embeddings from audio data, which can be useful for a variety of applications like speaker verification, diarization, and identification. By using trainable SincNet features, the model is able to achieve strong performance on speaker verification tasks without the need for additional processing steps.

## What can I use it for?

The `embedding` model can be used in a variety of applications that require speaker-level information, such as:

- **Speaker verification**: The model can be used to generate speaker embeddings that can be compared to determine if two audio samples are from the same speaker. This is useful for applications like access control or fraud detection.
- **Speaker diarization**: The model's embeddings can be used as input to a [speaker diarization](https://aimodels.fyi/models/huggingFace/speaker-diarization-pyannote) system to identify and segment different speakers within a longer audio recording.
- **Speaker identification**: The model's embeddings can be used to identify specific speakers within a dataset, which can be useful for applications like transcription or meeting analysis.

## Things to try

One interesting thing to try with the `embedding` model is to use it in combination with other audio processing techniques, such as voice activity detection (VAD) or probabilistic linear discriminant analysis (PLDA). By combining the model's speaker embeddings with these additional processing steps, you may be able to achieve even better performance on speaker verification and diarization tasks.

Another interesting experiment would be to fine-tune the model on a specific dataset or domain of interest, which could potentially improve its performance on certain types of audio data. The [maintainer's profile](https://aimodels.fyi/creators/huggingFace/pyannote) mentions that they offer consulting services to help users make the most of their open-source models, which could be a valuable resource for those looking to customize or optimize the `embedding` model for their specific needs.