MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:

*   Emotional speech rhythm and tone in English. No hallucinations.
*   Support for voice cloning with finetuning.
    *   We have had success with as little as 1 minute training data for Indian speakers.
*   Zero-shot cloning for American & British voices, with 30s reference audio.
*   Support for long-form synthesis.

Were releasing MetaVoice-1B under the Apache 2.0 license, _it can be used without restrictions_.

[](#usage)Usage
---------------

See [Github](https://github.com/metavoiceio/metavoice-src) for the latest usage instructions.

[](#finetuning)Finetuning
-------------------------

See [Github](https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#finetuning) for the latest finetuning instructions.

[](#soon)Soon
-------------

*   Long form / arbitrary length TTS
*   Streaming

[](#architecture)Architecture
-----------------------------

We predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.

*   We use a causal GPT to predict the first two hierarchies of EnCodec tokens. Text and audio are part of the LLM context. Speaker information is passed via conditioning at the token embedding layer. This speaker conditioning is obtained from a separately trained speaker verification network.
    *   The two hierarchies are predicted in a "flattened interleaved" manner, we predict the first token of the first hierarchy, then the first token of the second hierarchy, then the second token of the first hierarchy, and so on.
    *   We use condition-free sampling to boost the cloning capability of the model.
    *   The text is tokenised using a custom trained BPE tokeniser with 512 tokens.
    *   Note that we've skipped predicting semantic tokens as done in other works, as we found that this isn't strictly necessary.
*   We use a non-causal (encoder-style) transformer to predict the rest of the 6 hierarchies from the first two hierarchies. This is a super small model (~10Mn parameters), and has extensive zero-shot generalisation to most speakers we've tried. Since it's non-causal, we're also able to predict all the timesteps in parallel.
*   We use multi-band diffusion to generate waveforms from the EnCodec tokens. We noticed that the speech is clearer than using the original RVQ decoder or VOCOS. However, the diffusion at waveform level leaves some background artifacts which are quite unpleasant to the ear. We clean this up in the next step.
*   We use DeepFilterNet to clear up the artifacts introduced by the multi-band diffusion.

[](#optimizations)Optimizations
-------------------------------

The model supports:

1.  KV-caching via Flash Decoding
2.  Batching (including texts of different lengths)

## Model overview

`metavoice-1B-v0.1` is a 1.2B parameter base model for text-to-speech (TTS), trained by [metavoiceio](https://aimodels.fyi/creators/huggingFace/metavoiceio) on 100K hours of speech. It has been built with a focus on emotional speech rhythm and tone in English, without hallucinations. The model supports voice cloning with as little as 1 minute of training data for Indian speakers, and zero-shot cloning for American and British voices with just 30 seconds of reference audio. It also handles long-form synthesis.

Similar models include [metavoice](https://aimodels.fyi/models/huggingFace/metavoice-camenduru) by camenduru, and [WhisperSpeech](https://aimodels.fyi/models/huggingFace/whisperspeech-collabora) by collabora, which is an open-source text-to-speech system built by inverting Whisper.

## Model inputs and outputs

### Inputs
- Text prompts for TTS generation

### Outputs
- Synthesized speech audio in a variety of voices and emotional tones

## Capabilities

`metavoice-1B-v0.1` can generate emotional and expressive speech from text inputs, with the ability to clone voices from as little as 30 seconds of reference audio. It supports long-form synthesis, making it suitable for generating speech for extended passages of text.

## What can I use it for?

The `metavoice-1B-v0.1` model can be used to create engaging and personalized TTS applications, such as audiobook narration, podcast generation, or virtual assistant voices. Its voice cloning capabilities allow for easy customization and personalization of speech output. Developers could integrate the model into their applications to provide high-quality, emotionally-expressive speech synthesis.

## Things to try

Experiment with the model's ability to clone different accents and voices, even with minimal reference audio. Try generating long-form speech passages and observe the consistency and expressiveness of the output. Explore the model's robustness to different text inputs and genres, from formal to casual language.