[](#tts)TTS
=============

TTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours.

This is the same or similar model to what powers [Coqui Studio](https://coqui.ai/) and [Coqui API](https://docs.coqui.ai/docs).

### [](#features)Features

*   Supports 17 languages.
*   Voice cloning with just a 6-second audio clip.
*   Emotion and style transfer by cloning.
*   Cross-language voice cloning.
*   Multi-lingual speech generation.
*   24khz sampling rate.

### [](#updates-over-xtts-v1)Updates over XTTS-v1

*   2 new languages; Hungarian and Korean
*   Architectural improvements for speaker conditioning.
*   Enables the use of multiple speaker references and interpolation between speakers.
*   Stability improvements.
*   Better prosody and audio quality across the board.

### [](#languages)Languages

XTTS-v2 supports 17 languages: **English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko) Hindi (hi)**.

Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out!

### [](#code)Code

The [code-base](https://github.com/coqui-ai/TTS) supports inference and [fine-tuning](https://tts.readthedocs.io/en/latest/models/xtts.html#training).

### [](#demo-spaces)Demo Spaces

*   [XTTS Space](https://huggingface.co/spaces/coqui/xtts) : You can see how model performs on supported languages, and try with your own reference or microphone input
*   [XTTS Voice Chat with Mistral or Zephyr](https://huggingface.co/spaces/coqui/voice-chat-with-mistral) : You can experience streaming voice chat with Mistral 7B Instruct or Zephyr 7B Beta

 **CoquiTTS**

[coqui/TTS on Github](https://github.com/coqui-ai/TTS)

 **Documentation**

[ReadTheDocs](https://tts.readthedocs.io/en/latest/)

 **Questions**

[GitHub Discussions](https://github.com/coqui-ai/TTS/discussions)

 **Community**

[Discord](https://discord.gg/5eXr5seRrv)

### [](#license)License

This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). There's a lot that goes into a license for generative models, and you can read more of [the origin story of CPML here](https://coqui.ai/blog/tts/cpml).

### [](#contact)Contact

Come and join in our Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Twitter](https://twitter.com/coqui_ai). You can also mail us at [info@coqui.ai](mailto:info@coqui.ai).

Using TTS API:

    from TTS.api import TTS
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
    
    # generate speech by cloning a voice using default settings
    tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                    file_path="output.wav",
                    speaker_wav="/path/to/target/speaker.wav",
                    language="en")
    

Using TTS Command line:

     tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
         --text "Bugn okula gitmek istemiyorum." \
         --speaker_wav /path/to/target/speaker.wav \
         --language_idx tr \
         --use_cuda true
    

Using the model directly:

    from TTS.tts.configs.xtts_config import XttsConfig
    from TTS.tts.models.xtts import Xtts
    
    config = XttsConfig()
    config.load_json("/path/to/xtts/config.json")
    model = Xtts.init_from_config(config)
    model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True)
    model.cuda()
    
    outputs = model.synthesize(
        "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
        config,
        speaker_wav="/data/TTS-public/_refclips/3.wav",
        gpt_cond_len=3,
        language="en",
    )

## Model overview

`XTTS-v2` is a text-to-speech (TTS) model developed by [Coqui](https://aimodels.fyi/creators/huggingFace/coqui), a leading AI research company. It is an improved version of their previous `xtts-v1` model, which could clone voices using just a 3-second audio clip. `XTTS-v2` builds on this capability, allowing voice cloning with just a 6-second clip. It also supports 17 languages, including English, Spanish, French, German, Italian, and more.

Compared to similar models like [Whisper](https://aimodels.fyi/models/huggingFace/whisper-tiny-openai), which is a speech recognition model, `XTTS-v2` is focused specifically on generating high-quality synthetic speech. It can also perform emotion and style transfer by cloning voices, as well as cross-language voice cloning.

## Model inputs and outputs

### Inputs
- **Audio clip**: A 6-second audio clip used to clone the voice
- **Text**: The text to be converted to speech

### Outputs
- **Synthesized speech**: High-quality, natural-sounding speech in the cloned voice

## Capabilities

`XTTS-v2` can generate speech in 17 different languages, and it can clone voices with just a short 6-second audio sample. This makes it useful for a variety of applications, such as audio dubbing, text-to-speech, and voice-based user interfaces. The model also supports emotion and style transfer, allowing users to customize the tone and expression of the generated speech.

## What can I use it for?

`XTTS-v2` could be used in a wide range of applications, from creating custom audiobooks and podcasts to building voice-controlled assistants and translation services. Its ability to clone voices could be particularly useful for dubbing foreign language content or creating personalized audio experiences.

The model is available through the [Coqui API](https://docs.coqui.ai/docs) and can be integrated into a variety of projects and platforms. Coqui also provides a [demo space](https://huggingface.co/spaces/coqui/xtts) where users can try out the model and explore its capabilities.

## Things to try

One interesting aspect of `XTTS-v2` is its ability to perform cross-language voice cloning. This means you can clone a voice in one language and use it to generate speech in a different language. This could be useful for creating multilingual content or for providing language accessibility features.

Another interesting feature is the model's support for emotion and style transfer. By using different reference audio clips, you can make the generated speech sound more expressive, excited, or even somber. This could be useful for creating more engaging and natural-sounding audio content.

Overall, `XTTS-v2` is a powerful and versatile TTS model that could be a valuable tool for a wide range of applications. Its ability to clone voices with minimal training data and its multilingual capabilities make it a compelling option for developers and content creators alike.

[](#tts)TTS
=============

TTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. Built on Tortoise, TTS has important model changes that make cross-language voice cloning and multi-lingual speech generation super easy. There is no need for an excessive amount of training data that spans countless hours.

This is the same model that powers [Coqui Studio](https://coqui.ai/), and [Coqui API](https://docs.coqui.ai/docs), however we apply a few tricks to make it faster and support streaming inference.

[](#note-tts-v2-model-is-out-here-xtts-v2)NOTE: TTS V2 model is out here [XTTS V2](https://huggingface.co/coqui/XTTS-v2)
--------------------------------------------------------------------------------------------------------------------------

### [](#features)Features

*   Supports 14 languages.
*   Voice cloning with just a 6-second audio clip.
*   Emotion and style transfer by cloning.
*   Cross-language voice cloning.
*   Multi-lingual speech generation.
*   24khz sampling rate.

### [](#languages)Languages

As of now, XTTS-v1 (v1.1) supports 14 languages: **English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, and Japanese**.

Stay tuned as we continue to add support for more languages. If you have any language requests, please feel free to reach out!

### [](#code)Code

The current implementation supports inference and [fine-tuning](https://tts.readthedocs.io/en/latest/models/xtts.html#training).

### [](#license)License

This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). There's a lot that goes into a license for generative models, and you can read more of [the origin story of CPML here](https://coqui.ai/blog/tts/cpml).

### [](#contact)Contact

Come and join in our Community. We're active on [Discord](https://discord.gg/fBC58unbKE) and [Twitter](https://twitter.com/coqui_ai). You can also mail us at [info@coqui.ai](mailto:info@coqui.ai).

Using TTS API:

    from TTS.api import TTS
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1", gpu=True)
    
    # generate speech by cloning a voice using default settings
    tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                    file_path="output.wav",
                    speaker_wav="/path/to/target/speaker.wav",
                    language="en")
    
    # generate speech by cloning a voice using custom settings
    tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                    file_path="output.wav",
                    speaker_wav="/path/to/target/speaker.wav",
                    language="en",
                    decoder_iterations=30)
    

Using TTS Command line:

     tts --model_name tts_models/multilingual/multi-dataset/xtts_v1 \
         --text "Bugn okula gitmek istemiyorum." \
         --speaker_wav /path/to/target/speaker.wav \
         --language_idx tr \
         --use_cuda true
    

Using model directly:

    from TTS.tts.configs.xtts_config import XttsConfig
    from TTS.tts.models.xtts import Xtts
    
    config = XttsConfig()
    config.load_json("/path/to/xtts/config.json")
    model = Xtts.init_from_config(config)
    model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True)
    model.cuda()
    
    outputs = model.synthesize(
        "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
        config,
        speaker_wav="/data/TTS-public/_refclips/3.wav",
        gpt_cond_len=3,
        language="en",
    )

## Model overview

The `XTTS-v1` is a Text-to-Speech (TTS) model developed by [Coqui](https://aimodels.fyi/creators/huggingFace/coqui) that allows for voice cloning and multi-lingual speech generation. It is a powerful model that can generate high-quality speech from just a 6-second audio clip, enabling voice cloning, cross-language voice cloning, and emotion/style transfer. The model supports 14 languages out-of-the-box, including English, Spanish, French, German, and others. 

Similar models include the [XTTS-v2](https://aimodels.fyi/models/huggingFace/xtts-v2-coqui), which adds support for 17 languages and includes architectural improvements for better speaker conditioning, stability, prosody, and audio quality. Another similar model is [XTTS-v1](https://aimodels.fyi/models/huggingFace/xtts-v1-pagebrain) from Pagebrain, which can clone voices from just a 3-second audio clip. Microsoft's [SpeechT5 TTS](https://aimodels.fyi/models/huggingFace/speecht5tts-microsoft) model is a unified encoder-decoder model for various speech tasks including TTS.

## Model inputs and outputs

The `XTTS-v1` model takes text as input and generates high-quality audio as output. The input text can be in any of the 14 supported languages, and the model will generate the corresponding speech in that language. 

### Inputs
- **Text**: The text to be converted to speech, in one of the 14 supported languages.
- **Speaker audio**: A 6-second audio clip of the target speaker's voice, used for voice cloning.

### Outputs
- **Audio**: The generated speech audio, at a 24kHz sampling rate.

## Capabilities

The `XTTS-v1` model has several impressive capabilities, including:

- **Voice cloning**: The model can clone a speaker's voice using just a 6-second audio clip, enabling customized TTS.
- **Cross-language voice cloning**: The model can clone a voice and use it to generate speech in a different language.
- **Multi-lingual speech generation**: The model can generate high-quality speech in any of the 14 supported languages.
- **Emotion and style transfer**: The model can transfer the emotion and speaking style from the target speaker's voice.

## What can I use it for?

The `XTTS-v1` model has a wide range of potential applications, particularly in areas that require customized or multi-lingual TTS. Some ideas include:

- **Assistive technologies**: Generating personalized speech output for accessibility tools, smart speakers, or virtual assistants.
- **Audiobook and podcast production**: Creating high-quality, customized narration in multiple languages.
- **Dubbing and localization**: Translating and re-voicing content for international audiences.
- **Voice user interfaces**: Building conversational interfaces with natural-sounding, multi-lingual speech.
- **Media production**: Generating synthetic speech for animation, video games, or other media.

## Things to try

One interesting aspect of the `XTTS-v1` model is its ability to perform cross-language voice cloning. You could try using the model to generate speech in a language different from the target speaker's voice, exploring how well the model can preserve the speaker's characteristics while translating to a new language.

Another interesting experiment would be to test the model's emotion and style transfer capabilities. You could try using the model to generate speech that mimics the emotional tone or speaking style of the target speaker, even if the input text is quite different from the training data.

Overall, the `XTTS-v1` model offers a powerful and flexible TTS solution, with a range of capabilities that could be applied to many different use cases.