[](#bark)Bark
=============

Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.

The original github repo and model card can be found [here](https://github.com/suno-ai/bark).

This model is meant for research purposes only. The model output is not censored and the authors do not endorse the opinions in the generated content. Use at your own risk.

Two checkpoints are released:

*   [small](https://huggingface.co/suno/bark-small)
*   [**large** (this checkpoint)](https://huggingface.co/suno/bark)

[](#example)Example
-------------------

Try out Bark yourself!

*   Bark Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing)

*   Hugging Face Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing)

*   Hugging Face Demo:

[![Open in HuggingFace](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/suno/bark)

[](#-transformers-usage) Transformers Usage
-----------------------------------------------

You can run Bark locally with the  Transformers library from version 4.31.0 onwards.

1.  First install the  [Transformers library](https://github.com/huggingface/transformers) and scipy:

    pip install --upgrade pip
    pip install --upgrade transformers scipy
    

2.  Run inference via the `Text-to-Speech` (TTS) pipeline. You can infer the bark model via the TTS pipeline in just a few lines of code!

    from transformers import pipeline
    import scipy
    
    synthesiser = pipeline("text-to-speech", "suno/bark")
    
    speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})
    
    scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
    

3.  Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 24 kHz speech waveform for more fine-grained control.

    from transformers import AutoProcessor, AutoModel
    
    processor = AutoProcessor.from_pretrained("suno/bark")
    model = AutoModel.from_pretrained("suno/bark")
    
    inputs = processor(
        text=["Hello, my name is Suno. And, uh  and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
        return_tensors="pt",
    )
    
    speech_values = model.generate(**inputs, do_sample=True)
    

4.  Listen to the speech samples either in an ipynb notebook:

    from IPython.display import Audio
    
    sampling_rate = model.generation_config.sample_rate
    Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
    

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

    import scipy
    
    sampling_rate = model.config.sample_rate
    scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())
    

For more details on using the Bark model for inference using the  Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).

[](#suno-usage)Suno Usage
-------------------------

You can also run Bark locally through the original [Bark library](/suno/bark/blob/main/(https://github.com/suno-ai/bark):

1.  First install the [`bark` library](https://github.com/suno-ai/bark)
    
2.  Run the following Python code:
    

    from bark import SAMPLE_RATE, generate_audio, preload_models
    from IPython.display import Audio
    
    # download and load all models
    preload_models()
    
    # generate audio from text
    text_prompt = """
         Hello, my name is Suno. And, uh  and I like pizza. [laughs] 
         But I also have other interests such as playing tic tac toe.
    """
    speech_array = generate_audio(text_prompt)
    
    # play text in notebook
    Audio(speech_array, rate=SAMPLE_RATE)
    

[pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)

To save `audio_array` as a WAV file:

    from scipy.io.wavfile import write as write_wav
    
    write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
    

[](#model-details)Model Details
-------------------------------

The following is additional information about the models released here.

Bark is a series of three transformer models that turn text into audio.

### [](#text-to-semantic-tokens)Text to semantic tokens

*   Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
*   Output: semantic tokens that encode the audio to be generated

### [](#semantic-to-coarse-tokens)Semantic to coarse tokens

*   Input: semantic tokens
*   Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook

### [](#coarse-to-fine-tokens)Coarse to fine tokens

*   Input: the first two codebooks from EnCodec
*   Output: 8 codebooks from EnCodec

### [](#architecture)Architecture

Model

Parameters

Attention

Output Vocab size

Text to semantic tokens

80/300 M

Causal

10,000

Semantic to coarse tokens

80/300 M

Causal

2x 1,024

Coarse to fine tokens

80/300 M

Non-causal

6x 1,024

### [](#release-date)Release date

April 2023

[](#broader-implications)Broader Implications
---------------------------------------------

We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.

While we hope that this release will enable users to express their creativity and build applications that are a force for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark, we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).

## Model overview

`Bark` is a transformer-based text-to-audio model created by [Suno](https://aimodels.fyi/creators/huggingFace/suno). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. Bark is similar to other text-to-speech models like [whisper-tiny](https://aimodels.fyi/models/huggingFace/whisper-tiny-openai) and [parakeet-rnnt-1.1b](https://aimodels.fyi/models/huggingFace/parakeet-rnnt-11b-nvlabs), but is focused on generating a wider range of audio outputs beyond just speech.

## Model inputs and outputs

The `Bark` model takes text as input and generates corresponding audio as output. It can produce speech in multiple languages, as well as non-verbal sounds and audio effects.

### Inputs
- **Text**: The text to be converted to audio. This can be in any language supported by the model.

### Outputs
- **Audio**: The generated audio corresponding to the input text. This can be speech, ambient sounds, music, or other audio effects.

## Capabilities

Bark demonstrates the ability to generate highly realistic and expressive audio outputs. Beyond just speech synthesis, the model can create a diverse range of audio, including background noise, laughter, sighs, and even simple musical elements. This versatility allows Bark to be used for a variety of applications, from virtual assistants to audio production.

## What can I use it for?

The `Bark` model could be used to create interactive voice experiences, such as virtual assistants or audio-based storytelling. Its ability to generate non-verbal sounds could also make it useful for enhancing the realism of video game characters or animating digital avatars. Additionally, Bark's text-to-speech capabilities could aid in accessibility by converting text to audio for the visually impaired.

## Things to try

One interesting aspect of Bark is its ability to generate diverse non-speech audio. You could experiment with prompting the model to create different types of ambient sounds, like wind, rain, or nature noises, to enhance virtual environments. Additionally, you could try generating audio with emotional expressions, such as laughter or sighs, to bring more life and personality to digital characters.

[](#bark)Bark
=============

Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.

The original github repo and model card can be found [here](https://github.com/suno-ai/bark).

This model is meant for research purposes only. The model output is not censored and the authors do not endorse the opinions in the generated content. Use at your own risk.

Two checkpoints are released:

*   [**small** (this checkpoint)](https://huggingface.co/suno/bark-small)
*   [large](https://huggingface.co/suno/bark)

[](#example)Example
-------------------

Try out Bark yourself!

*   Bark Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing)

*   Hugging Face Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing)

*   Hugging Face Demo:

[![Open in HuggingFace](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/suno/bark)

[](#-transformers-usage) Transformers Usage
-----------------------------------------------

You can run Bark locally with the  Transformers library from version 4.31.0 onwards.

1.  First install the  [Transformers library](https://github.com/huggingface/transformers) and scipy:

    pip install --upgrade pip
    pip install --upgrade transformers scipy
    

2.  Run inference via the `Text-to-Speech` (TTS) pipeline. You can infer the bark model via the TTS pipeline in just a few lines of code!

    from transformers import pipeline
    import scipy
    
    synthesiser = pipeline("text-to-speech", "suno/bark-small")
    
    speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})
    
    scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
    

3.  Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 24 kHz speech waveform for more fine-grained control.

    from transformers import AutoProcessor, AutoModel
    
    processor = AutoProcessor.from_pretrained("suno/bark-small")
    model = AutoModel.from_pretrained("suno/bark-small")
    
    inputs = processor(
        text=["Hello, my name is Suno. And, uh  and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
        return_tensors="pt",
    )
    
    speech_values = model.generate(**inputs, do_sample=True)
    

4.  Listen to the speech samples either in an ipynb notebook:

    from IPython.display import Audio
    
    sampling_rate = model.generation_config.sample_rate
    Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
    

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

    import scipy
    
    sampling_rate = model.config.sample_rate
    scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())
    

For more details on using the Bark model for inference using the  Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).

### [](#optimization-tips)Optimization tips

Refers to this [blog post](https://huggingface.co/blog/optimizing-bark#benchmark-results) to find out more about the following methods and a benchmark of their benefits.

#### [](#get-significant-speed-ups)Get significant speed-ups:

**Using  Better Transformer**

Better Transformer is an  Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to  Better Transformer:

    model =  model.to_bettertransformer()
    

Note that  Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)

**Using Flash Attention 2**

Flash Attention 2 is an even faster, optimized version of the previous optimization.

    model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
    

Make sure to load your model in half-precision (e.g. \`torch.float16\`\`) and to [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2.

**Note:** Flash Attention 2 is only available on newer GPUs, refer to  Better Transformer in case your GPU don't support it.

#### [](#reduce-memory-footprint)Reduce memory footprint:

**Using half-precision**

You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision (e.g. \`torch.float16\`\`).

**Using CPU offload**

Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.

If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.

    model.enable_cpu_offload()
    

Note that  Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)

[](#suno-usage)Suno Usage
-------------------------

You can also run Bark locally through the original [Bark library](/suno/bark-small/blob/main/(https://github.com/suno-ai/bark):

1.  First install the [`bark` library](https://github.com/suno-ai/bark)
    
2.  Run the following Python code:
    

    from bark import SAMPLE_RATE, generate_audio, preload_models
    from IPython.display import Audio
    
    # download and load all models
    preload_models()
    
    # generate audio from text
    text_prompt = """
         Hello, my name is Suno. And, uh  and I like pizza. [laughs] 
         But I also have other interests such as playing tic tac toe.
    """
    speech_array = generate_audio(text_prompt)
    
    # play text in notebook
    Audio(speech_array, rate=SAMPLE_RATE)
    

[pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)

To save `audio_array` as a WAV file:

    from scipy.io.wavfile import write as write_wav
    
    write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
    

[](#model-details)Model Details
-------------------------------

The following is additional information about the models released here.

Bark is a series of three transformer models that turn text into audio.

### [](#text-to-semantic-tokens)Text to semantic tokens

*   Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
*   Output: semantic tokens that encode the audio to be generated

### [](#semantic-to-coarse-tokens)Semantic to coarse tokens

*   Input: semantic tokens
*   Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook

### [](#coarse-to-fine-tokens)Coarse to fine tokens

*   Input: the first two codebooks from EnCodec
*   Output: 8 codebooks from EnCodec

### [](#architecture)Architecture

Model

Parameters

Attention

Output Vocab size

Text to semantic tokens

80/300 M

Causal

10,000

Semantic to coarse tokens

80/300 M

Causal

2x 1,024

Coarse to fine tokens

80/300 M

Non-causal

6x 1,024

### [](#release-date)Release date

April 2023

[](#broader-implications)Broader Implications
---------------------------------------------

We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.

While we hope that this release will enable users to express their creativity and build applications that are a force for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark, we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).

[](#license)License
-------------------

Bark is licensed under the [MIT License](https://github.com/suno-ai/bark/blob/main/LICENSE), meaning it's available for commercial use.

## Model overview

`bark-small` is a transformer-based text-to-audio model created by [Suno](https://aimodels.fyi/creators/huggingFace/suno). It can generate highly realistic, multilingual speech as well as other audio including music, background noise, and simple sound effects. The model can also produce nonverbal communications like laughing, sighing, and crying. 

The `bark-small` checkpoint is one of two Bark model versions released by Suno, with the other being the larger `bark` model. Both models demonstrate impressive text-to-speech capabilities, though the `bark-small` version may have slightly lower fidelity compared to the larger model.

## Model inputs and outputs

### Inputs
- **Text**: The model takes text prompts as input, which it then uses to generate the corresponding audio.
- **Description**: Along with the text prompt, users can provide a description that gives the model additional information about how the speech should be generated (e.g. voice gender, speaking style, background noise).

### Outputs
- **Audio**: The primary output of the `bark-small` model is high-quality, natural-sounding audio that corresponds to the given text prompt and description.

## Capabilities

The `bark-small` model can generate a wide range of audio content beyond just speech, including music, ambient sounds, and even nonverbal expressions like laughter and sighs. This versatility makes it a powerful tool for creating immersive audio experiences. The model is also multilingual, allowing users to generate speech in numerous languages.

## What can I use it for?

The `bark-small` model's ability to generate high-quality, expressive audio from text makes it well-suited for a variety of applications. Potential use cases include:

- Enhancing accessibility by generating audio versions of text content
- Creating more engaging audio experiences for games, films, or podcasts
- Prototyping voice interfaces or conversational AI assistants
- Generating audio prompts for AI models like DALL-E or Imagen

While the model is not intended for real-time applications, its speed and quality suggest that developers could build applications on top of it that allow for near-real-time speech generation.

## Things to try

One interesting feature of the `bark-small` model is its ability to generate nonverbal sounds like laughter, sighs, and vocal expressions. Experimenting with prompts that incorporate these elements can help uncover the model's expressive range and create more natural-sounding audio.

Additionally, users can try providing detailed descriptions to guide the model's generation, such as specifying the speaker's gender, tone, background environment, and other attributes. Exploring how these descriptors influence the output can lead to more tailored and nuanced audio experiences.