[](#speaker-verification-with-ecapa-tdnn-embeddings-on-voxceleb)Speaker Verification with ECAPA-TDNN embeddings on Voxceleb
===========================================================================================================================

This repository provides all the necessary tools to perform speaker verification with a pretrained ECAPA-TDNN model using SpeechBrain. The system can be used to extract speaker embeddings as well. It is trained on Voxceleb 1+ Voxceleb2 training data.

For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io). The model performance on Voxceleb1-test set(Cleaned) is:

Release

EER(%)

05-03-21

0.80

[](#pipeline-description)Pipeline description
---------------------------------------------

This system is composed of an ECAPA-TDNN model. It is a combination of convolutional and residual blocks. The embeddings are extracted using attentive statistical pooling. The system is trained with Additive Margin Softmax Loss. Speaker Verification is performed using cosine distance between speaker embeddings.

[](#install-speechbrain)Install SpeechBrain
-------------------------------------------

First of all, please install SpeechBrain with the following command:

    pip install git+https://github.com/speechbrain/speechbrain.git@develop
    

Please notice that we encourage you to read our tutorials and learn more about [SpeechBrain](https://speechbrain.github.io).

### [](#compute-your-speaker-embeddings)Compute your speaker embeddings

    import torchaudio
    from speechbrain.inference.speaker import EncoderClassifier
    classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
    signal, fs =torchaudio.load('tests/samples/ASR/spk1_snt1.wav')
    embeddings = classifier.encode_batch(signal)
    

The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling _classify\_file_ if needed. Make sure your input tensor is compliant with the expected sampling rate if you use _encode\_batch_ and _classify\_batch_.

### [](#perform-speaker-verification)Perform Speaker Verification

    from speechbrain.inference.speaker import SpeakerRecognition
    verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
    score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk2_snt1.wav") # Different Speakers
    score, prediction = verification.verify_files("tests/samples/ASR/spk1_snt1.wav", "tests/samples/ASR/spk1_snt2.wav") # Same Speaker
    

The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise.

### [](#inference-on-gpu)Inference on GPU

To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.

### [](#training)Training

The model was trained with SpeechBrain (aa018540). To train it from scratch follows these steps:

1.  Clone SpeechBrain:

    git clone https://github.com/speechbrain/speechbrain/
    

2.  Install it:

    cd speechbrain
    pip install -r requirements.txt
    pip install -e .
    

3.  Run Training:

    cd  recipes/VoxCeleb/SpeakerRec
    python train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml --data_folder=your_data_folder
    

You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1-ahC1xeyPinAHp2oAohL-02smNWO41Cc?usp=sharing).

### [](#limitations)Limitations

The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

#### [](#referencing-ecapa-tdnn)Referencing ECAPA-TDNN

    @inproceedings{DBLP:conf/interspeech/DesplanquesTD20,
      author    = {Brecht Desplanques and
                   Jenthe Thienpondt and
                   Kris Demuynck},
      editor    = {Helen Meng and
                   Bo Xu and
                   Thomas Fang Zheng},
      title     = {{ECAPA-TDNN:} Emphasized Channel Attention, Propagation and Aggregation
                   in {TDNN} Based Speaker Verification},
      booktitle = {Interspeech 2020},
      pages     = {3830--3834},
      publisher = {{ISCA}},
      year      = {2020},
    }
    

[](#citing-speechbrain)**Citing SpeechBrain**
=============================================

Please, cite SpeechBrain if you use it for your research or business.

    @misc{speechbrain,
      title={{SpeechBrain}: A General-Purpose Speech Toolkit},
      author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Franois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
      year={2021},
      eprint={2106.04624},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      note={arXiv:2106.04624}
    }
    

[](#about-speechbrain)**About SpeechBrain**
===========================================

*   Website: [https://speechbrain.github.io/](https://speechbrain.github.io/)
*   Code: [https://github.com/speechbrain/speechbrain/](https://github.com/speechbrain/speechbrain/)
*   HuggingFace: [https://huggingface.co/speechbrain/](https://huggingface.co/speechbrain/)

## Model overview

The `spkrec-ecapa-voxceleb` model is a speaker verification system developed by the SpeechBrain team. It uses the ECAPA-TDNN architecture, which combines convolutional and residual blocks, to extract speaker embeddings from audio recordings. The model was trained on the Voxceleb 1 and Voxceleb 2 datasets, achieving an impressive Equal Error Rate (EER) of 0.8% on the Voxceleb1-test (Cleaned) dataset.

Similar models include the [VoxLingua107 ECAPA-TDNN Spoken Language Identification Model](https://aimodels.fyi/models/huggingFace/lang-id-voxlingua107-ecapa-speechbrain) and the [Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0](https://aimodels.fyi/models/huggingFace/wav2vec2-lg-xlsr-en-speech-emotion-recognition-ehcalabres) model, both of which leverage the ECAPA-TDNN architecture for different tasks.

## Model inputs and outputs

### Inputs
- Audio recordings, typically sampled at 16kHz (single channel)

### Outputs
- Speaker embeddings: A 768-dimensional vector that captures the speaker's voice characteristics
- Speaker verification score: A score indicating the likelihood that two audio recordings belong to the same speaker

## Capabilities

The `spkrec-ecapa-voxceleb` model is highly capable at speaker verification tasks. It can be used to determine whether two audio recordings are from the same speaker by computing the cosine distance between their speaker embeddings. The model has demonstrated state-of-the-art performance on the Voxceleb benchmark, making it a reliable choice for applications that require accurate speaker identification.

## What can I use it for?

The `spkrec-ecapa-voxceleb` model can be used in a variety of applications that require speaker verification, such as:

- Voice-based authentication systems: Verify the identity of users based on their voice characteristics.
- Speaker diarization: Identify and separate different speakers in an audio recording.
- Personalized digital assistants: Recognize the user's voice and tailor the experience accordingly.
- Biometric security: Enhance security by using voice as an additional biometric factor.

## Things to try

One interesting thing to try with the `spkrec-ecapa-voxceleb` model is to use it as a feature extractor for other speaker-related tasks. The 768-dimensional speaker embeddings produced by the model can be a valuable input for training custom speaker recognition or speaker diarization models. Additionally, you could explore ways to combine the speaker embeddings with other modalities, such as text or visual information, to create multimodal speaker recognition systems.

  
  

[](#text-to-speech-tts-with-tacotron2-trained-on-ljspeech)Text-to-Speech (TTS) with Tacotron2 trained on LJSpeech
=================================================================================================================

This repository provides all the necessary tools for Text-to-Speech (TTS) with SpeechBrain using a [Tacotron2](https://arxiv.org/abs/1712.05884) pretrained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

The pre-trained model takes in input a short text and produces a spectrogram in output. One can get the final waveform by applying a vocoder (e.g., HiFIGAN) on top of the generated spectrogram.

[](#install-speechbrain)Install SpeechBrain
-------------------------------------------

    pip install speechbrain
    

Please notice that we encourage you to read our tutorials and learn more about [SpeechBrain](https://speechbrain.github.io).

### [](#perform-text-to-speech-tts)Perform Text-to-Speech (TTS)

    import torchaudio
    from speechbrain.inference.TTS import Tacotron2
    from speechbrain.inference.vocoders import HIFIGAN
    
    # Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
    tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
    hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")
    
    # Running the TTS
    mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
    
    # Running Vocoder (spectrogram-to-waveform)
    waveforms = hifi_gan.decode_batch(mel_output)
    
    # Save the waverform
    torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)
    

If you want to generate multiple sentences in one-shot, you can do in this way:

    from speechbrain.pretrained import Tacotron2
    tacotron2 = Tacotron2.from_hparams(source="speechbrain/TTS_Tacotron2", savedir="tmpdir")
    items = [
           "A quick brown fox jumped over the lazy dog",
           "How much wood would a woodchuck chuck?",
           "Never odd or even"
         ]
    mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
    

### [](#inference-on-gpu)Inference on GPU

To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.

### [](#training)Training

The model was trained with SpeechBrain. To train it from scratch follow these steps:

1.  Clone SpeechBrain:

    git clone https://github.com/speechbrain/speechbrain/
    

2.  Install it:

    cd speechbrain
    pip install -r requirements.txt
    pip install -e .
    

3.  Run Training:

    cd recipes/LJSpeech/TTS/tacotron2/
    python train.py --device=cuda:0 --max_grad_norm=1.0 --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
    

You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1PKju-_Nal3DQqd-n0PsaHK-bVIOlbf26?usp=sharing).

### [](#limitations)Limitations

The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

[](#about-speechbrain)**About SpeechBrain**
===========================================

*   Website: [https://speechbrain.github.io/](https://speechbrain.github.io/)
*   Code: [https://github.com/speechbrain/speechbrain/](https://github.com/speechbrain/speechbrain/)
*   HuggingFace: [https://huggingface.co/speechbrain/](https://huggingface.co/speechbrain/)

[](#citing-speechbrain)**Citing SpeechBrain**
=============================================

Please, cite SpeechBrain if you use it for your research or business.

    @misc{speechbrain,
      title={{SpeechBrain}: A General-Purpose Speech Toolkit},
      author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Franois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
      year={2021},
      eprint={2106.04624},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      note={arXiv:2106.04624}
    }

## Model overview

The `tts-tacotron2-ljspeech` model is a Text-to-Speech (TTS) model developed by [SpeechBrain](https://aimodels.fyi/creators/huggingFace/speechbrain) that uses the Tacotron2 architecture trained on the LJSpeech dataset. This model takes in text input and generates a spectrogram output, which can then be converted to an audio waveform using a vocoder like HiFiGAN. The model was trained to produce high-quality, natural-sounding speech.

Compared to similar TTS models like [XTTS-v2](https://aimodels.fyi/models/huggingFace/xtts-v2-coqui) and [speecht5_tts](https://aimodels.fyi/models/huggingFace/speecht5tts-microsoft), the `tts-tacotron2-ljspeech` model is focused specifically on English text-to-speech generation using the Tacotron2 architecture, while the other models offer more multilingual capabilities or additional tasks like speech translation.

## Model inputs and outputs

### Inputs
- **Text**: The model accepts text input, which it then converts to a spectrogram.

### Outputs
- **Spectrogram**: The model outputs a spectrogram representation of the generated speech.
- **Alignment**: The model also outputs an alignment matrix, which shows the relationship between the input text and the generated spectrogram.

## Capabilities

The `tts-tacotron2-ljspeech` model is capable of generating high-quality, natural-sounding English speech from text input. It can capture features like prosody and intonation, resulting in speech that sounds more human-like compared to simpler text-to-speech systems.

## What can I use it for?

You can use the `tts-tacotron2-ljspeech` model to add text-to-speech capabilities to your applications, such as:

- **Voice assistants**: Integrate the model into a voice assistant to allow users to interact with your application using natural language.
- **Audiobook generation**: Generate high-quality audio narrations from text, such as for creating digital audiobooks.
- **Language learning**: Use the model to provide pronunciations and examples of spoken English for language learners.

## Things to try

One interesting aspect of the `tts-tacotron2-ljspeech` model is its ability to capture prosody and intonation in the generated speech. Try experimenting with different types of input text, such as sentences with various punctuation or emotional tone, to see how the model handles them. You can also try combining the model with a vocoder like HiFiGAN to generate the final audio waveform and listen to the results.

  
  

[](#emotion-recognition-with-wav2vec2-base-on-iemocap)Emotion Recognition with wav2vec2 base on IEMOCAP
=======================================================================================================

This repository provides all the necessary tools to perform emotion recognition with a fine-tuned wav2vec2 (base) model using SpeechBrain. It is trained on IEMOCAP training data.

For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io). The model performance on IEMOCAP test set is:

Release

Accuracy(%)

19-10-21

78.7 (Avg: 75.3)

[](#pipeline-description)Pipeline description
---------------------------------------------

This system is composed of an wav2vec2 model. It is a combination of convolutional and residual blocks. The embeddings are extracted using attentive statistical pooling. The system is trained with Additive Margin Softmax Loss. Speaker Verification is performed using cosine distance between speaker embeddings.

The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling _classify\_file_ if needed.

[](#install-speechbrain)Install SpeechBrain
-------------------------------------------

First of all, please install the **development** version of SpeechBrain with the following command:

    pip install git+https://github.com/speechbrain/speechbrain.git@$develop
    

Please notice that we encourage you to read our tutorials and learn more about [SpeechBrain](https://speechbrain.github.io).

### [](#perform-emotion-recognition)Perform Emotion recognition

An external `py_module_file=custom.py` is used as an external Predictor class into this HF repos. We use `foreign_class` function from `speechbrain.pretrained.interfaces` that allow you to load you custom model.

    from speechbrain.inference.interfaces import foreign_class
    classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier")
    out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
    print(text_lab)
    

The prediction tensor will contain a tuple of (embedding, id\_class, label\_name).

### [](#inference-on-gpu)Inference on GPU

To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.

### [](#training)Training

The model was trained with SpeechBrain (aa018540). To train it from scratch follows these steps:

1.  Clone SpeechBrain:

    git clone https://github.com/speechbrain/speechbrain/
    

2.  Install it:

    cd speechbrain
    pip install -r requirements.txt
    pip install -e .
    

3.  Run Training:

    cd  recipes/IEMOCAP/emotion_recognition
    python train_with_wav2vec2.py hparams/train_with_wav2vec2.yaml --data_folder=your_data_folder
    

You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/15dKQetLuAhSyg4sNOtbSDnuxFdEeU4zQ?usp=sharing).

### [](#limitations)Limitations

The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

[](#citing-speechbrain)**Citing SpeechBrain**
=============================================

Please, cite SpeechBrain if you use it for your research or business.

    @misc{speechbrain,
      title={{SpeechBrain}: A General-Purpose Speech Toolkit},
      author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Franois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
      year={2021},
      eprint={2106.04624},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      note={arXiv:2106.04624}
    }
    

[](#about-speechbrain)**About SpeechBrain**
===========================================

*   Website: [https://speechbrain.github.io/](https://speechbrain.github.io/)
*   Code: [https://github.com/speechbrain/speechbrain/](https://github.com/speechbrain/speechbrain/)
*   HuggingFace: [https://huggingface.co/speechbrain/](https://huggingface.co/speechbrain/)

## Model overview

The `emotion-recognition-wav2vec2-IEMOCAP` model is a speech emotion recognition system developed by [SpeechBrain](https://aimodels.fyi/creators/huggingFace/speechbrain). It uses a fine-tuned `wav2vec2` model to classify audio recordings into one of several emotional categories. This model is similar to other speech emotion recognition models like [wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://aimodels.fyi/models/huggingFace/wav2vec2-lg-xlsr-en-speech-emotion-recognition-ehcalabres) and [wav2vec2-large-robust-12-ft-emotion-msp-dim](https://aimodels.fyi/models/huggingFace/wav2vec2-large-robust-12-ft-emotion-msp-dim-audeering), which also leverage the `wav2vec2` architecture for this task.

## Model inputs and outputs

### Inputs
- **Audio recordings**: The model takes raw audio recordings as input, which are automatically normalized to 16kHz single-channel format if needed.

### Outputs
- **Emotion classification**: The model outputs a predicted emotion category, such as "angry", "calm", "disgust", etc.
- **Confidence score**: The model also returns a confidence score for the predicted emotion.

## Capabilities

The `emotion-recognition-wav2vec2-IEMOCAP` model can accurately classify the emotional content of audio recordings, achieving 78.7% accuracy on the IEMOCAP test set. This makes it a useful tool for applications that require understanding the emotional state of speakers, such as customer service, mental health monitoring, or interactive voice assistants.

## What can I use it for?

This model could be integrated into a variety of applications that need to analyze the emotional tone of speech, such as:

- **Call center analytics**: Analyze customer service calls to better understand customer sentiment and identify areas for improvement.
- **Mental health monitoring**: Use the model to track changes in a patient's emotional state over time as part of remote mental health monitoring.
- **Conversational AI**: Incorporate the model into a virtual assistant to enable more natural and empathetic interactions.

## Things to try

One interesting thing to try with this model is to experiment with different audio preprocessing techniques, such as data augmentation or feature engineering, to see if you can further improve its performance on your specific use case. You could also explore combining this model with other speech technologies, like [speaker verification](https://aimodels.fyi/models/huggingFace/spkrec-ecapa-voxceleb-speechbrain), to create more advanced speech analysis systems.

[](#voxlingua107-ecapa-tdnn-spoken-language-identification-model)VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
=============================================================================================================================

[](#model-description)Model description
---------------------------------------

This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain. The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. However, it uses more fully connected hidden layers after the embedding layer, and cross-entropy loss was used for training. We observed that this improved the performance of extracted utterance embeddings for downstream tasks.

The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling _classify\_file_ if needed.

The model can classify a speech utterance according to the language spoken. It covers 107 different languages ( Abkhazian, Afrikaans, Amharic, Arabic, Assamese, Azerbaijani, Bashkir, Belarusian, Bulgarian, Bengali, Tibetan, Breton, Bosnian, Catalan, Cebuano, Czech, Welsh, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, Faroese, French, Galician, Guarani, Gujarati, Manx, Hausa, Hawaiian, Hindi, Croatian, Haitian, Hungarian, Armenian, Interlingua, Indonesian, Icelandic, Italian, Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Latin, Luxembourgish, Lingala, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian Nynorsk, Norwegian, Occitan, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sanskrit, Scots, Sindhi, Sinhala, Slovak, Slovenian, Shona, Somali, Albanian, Serbian, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Turkmen, Tagalog, Turkish, Tatar, Ukrainian, Urdu, Uzbek, Vietnamese, Waray, Yiddish, Yoruba, Mandarin Chinese).

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

The model has two uses:

*   use 'as is' for spoken language recognition
*   use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data

The model is trained on automatically collected YouTube data. For more information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).

#### [](#how-to-use)How to use

    pip install git+https://github.com/speechbrain/speechbrain.git@develop
    

    import torchaudio
    from speechbrain.inference.classifiers import EncoderClassifier
    language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
    # Download Thai language sample from Omniglot and cvert to suitable form
    signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
    prediction =  language_id.classify_batch(signal)
    print(prediction)
    #  (tensor([[-2.8646e+01, -3.0346e+01, -2.0748e+01, -2.9562e+01, -2.2187e+01,
    #         -3.2668e+01, -3.6677e+01, -3.3573e+01, -3.2545e+01, -2.4365e+01,
    #         -2.4688e+01, -3.1171e+01, -2.7743e+01, -2.9918e+01, -2.4770e+01,
    #         -3.2250e+01, -2.4727e+01, -2.6087e+01, -2.1870e+01, -3.2821e+01,
    #         -2.2128e+01, -2.2822e+01, -3.0888e+01, -3.3564e+01, -2.9906e+01,
    #         -2.2392e+01, -2.5573e+01, -2.6443e+01, -3.2429e+01, -3.2652e+01,
    #         -3.0030e+01, -2.4607e+01, -2.2967e+01, -2.4396e+01, -2.8578e+01,
    #         -2.5153e+01, -2.8475e+01, -2.6409e+01, -2.5230e+01, -2.7957e+01,
    #         -2.6298e+01, -2.3609e+01, -2.5863e+01, -2.8225e+01, -2.7225e+01,
    #         -3.0486e+01, -2.1185e+01, -2.7938e+01, -3.3155e+01, -1.9076e+01,
    #         -2.9181e+01, -2.2160e+01, -1.8352e+01, -2.5866e+01, -3.3636e+01,
    #         -4.2016e+00, -3.1581e+01, -3.1894e+01, -2.7834e+01, -2.5429e+01,
    #         -3.2235e+01, -3.2280e+01, -2.8786e+01, -2.3366e+01, -2.6047e+01,
    #         -2.2075e+01, -2.3770e+01, -2.2518e+01, -2.8101e+01, -2.5745e+01,
    #         -2.6441e+01, -2.9822e+01, -2.7109e+01, -3.0225e+01, -2.4566e+01,
    #         -2.9268e+01, -2.7651e+01, -3.4221e+01, -2.9026e+01, -2.6009e+01,
    #         -3.1968e+01, -3.1747e+01, -2.8156e+01, -2.9025e+01, -2.7756e+01,
    #         -2.8052e+01, -2.9341e+01, -2.8806e+01, -2.1636e+01, -2.3992e+01,
    #         -2.3794e+01, -3.3743e+01, -2.8332e+01, -2.7465e+01, -1.5085e-02,
    #         -2.9094e+01, -2.1444e+01, -2.9780e+01, -3.6046e+01, -3.7401e+01,
    #         -3.0888e+01, -3.3172e+01, -1.8931e+01, -2.2679e+01, -3.0225e+01,
    #         -2.4995e+01, -2.1028e+01]]), tensor([-0.0151]), tensor([94]), ['th'])
    # The scores in the prediction[0] tensor can be interpreted as log-likelihoods that
    # the given utterance belongs to the given language (i.e., the larger the better)
    # The linear-scale likelihood can be retrieved using the following:
    print(prediction[1].exp())
    #  tensor([0.9850])
    # The identified language ISO code is given in prediction[3]
    print(prediction[3])
    #  ['th: Thai']
      
    # Alternatively, use the utterance embedding extractor:
    emb =  language_id.encode_batch(signal)
    print(emb.shape)
    # torch.Size([1, 1, 256])
    

To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.

The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling _classify\_file_ if needed. Make sure your input tensor is compliant with the expected sampling rate if you use _encode\_batch_ and _classify\_batch_.

#### [](#limitations-and-bias)Limitations and bias

Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:

*   Probably it's accuracy on smaller languages is quite limited
*   Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
*   Based on subjective experiments, it doesn't work well on speech with a foreign accent
*   Probably it doesn't work well on children's speech and on persons with speech disorders

[](#training-data)Training data
-------------------------------

The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).

VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.

VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours. The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.

[](#training-procedure)Training procedure
-----------------------------------------

See the [SpeechBrain recipe](https://github.com/speechbrain/speechbrain/tree/voxlingua107/recipes/VoxLingua107/lang_id).

[](#evaluation-results)Evaluation results
-----------------------------------------

Error rate: 6.7% on the VoxLingua107 development dataset

#### [](#referencing-speechbrain)Referencing SpeechBrain

    @misc{speechbrain,
      title={{SpeechBrain}: A General-Purpose Speech Toolkit},
      author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and Franois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
      year={2021},
      eprint={2106.04624},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      note={arXiv:2106.04624}
    }
    

### [](#referencing-voxlingua107)Referencing VoxLingua107

    @inproceedings{valk2021slt,
      title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
      author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
      booktitle={Proc. IEEE SLT Workshop},
      year={2021},
    }
    

#### [](#about-speechbrain)About SpeechBrain

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains. Website: [https://speechbrain.github.io/](https://speechbrain.github.io/) GitHub: [https://github.com/speechbrain/speechbrain](https://github.com/speechbrain/speechbrain)

## Model overview

The `lang-id-voxlingua107-ecapa` model is a spoken language recognition model trained on the VoxLingua107 dataset using the SpeechBrain framework. It utilizes the ECAPA-TDNN architecture, which has previously been used for speaker recognition tasks. The model can classify speech utterances into one of 107 different languages, including Abkhazian, Afrikaans, Amharic, and many more. This model was developed by the [speechbrain](https://aimodels.fyi/creators/huggingFace/speechbrain) team.

The `xlm-roberta-base-language-detection` model is a fine-tuned version of the `xlm-roberta-base` model on the Language Identification dataset. It can classify text sequences into 20 different languages, including Arabic, English, French, and Chinese. This model was created by [papluca](https://aimodels.fyi/creators/huggingFace/papluca).

## Model inputs and outputs

### Inputs
- Audio waveform (16 kHz, single channel)

### Outputs
- Language classification (one of 107 languages)

## Capabilities

The `lang-id-voxlingua107-ecapa` model can accurately classify speech utterances into one of 107 different languages. This can be useful for various applications, such as language identification in multilingual environments, language-specific speech processing, and language-aware user interfaces.

## What can I use it for?

The `lang-id-voxlingua107-ecapa` model can be used as a standalone language identification system or as a feature extractor for creating a custom language ID model on your own data. For example, you could use this model to build a multilingual chatbot or transcription service that can handle a wide variety of languages.

## Things to try

One interesting thing to try with the `lang-id-voxlingua107-ecapa` model is to use it as a feature extractor for downstream tasks. By taking the utterance embeddings produced by the model, you can create a dedicated language ID model tailored to your specific use case, potentially improving performance beyond the general-purpose capabilities of the pre-trained model.