Last updated 6/19/2024
Model overview

VoiceCraft is a token infilling neural codec language model developed by the maintainer cjwbw. It achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts. Unlike similar voice cloning models like instant-id which require high-quality reference audio, VoiceCraft can clone an unseen voice with just a few seconds of reference.

Model inputs and outputs

VoiceCraft is a versatile model that can be used for both speech editing and zero-shot text-to-speech. For speech editing, the model takes in the original audio, the transcript, and target edits to the transcript. For zero-shot TTS, the model only requires a few seconds of reference audio and the target transcript.


  • Original audio: The audio file to be edited or used as a reference for TTS
  • Original transcript: The transcript of the original audio, can be automatically generated using a model like WhisperX
  • Target transcript: The desired transcript for the edited or synthesized audio
  • Reference audio duration: The duration of the original audio to use as a reference for zero-shot TTS


  • Edited audio: The audio with the specified edits applied
  • Synthesized audio: The audio generated from the target transcript using the reference audio


VoiceCraft is capable of high-quality speech editing and zero-shot text-to-speech. It can seamlessly blend new content into existing audio, enabling tasks like adding or removing words, changing the speaker's voice, or modifying emotional tone. For zero-shot TTS, VoiceCraft can generate natural-sounding speech in the voice of the reference audio, without any fine-tuning or additional training.

What can I use it for?

VoiceCraft can be used in a variety of applications, such as podcast production, audiobook creation, video dubbing, and voice assistant development. With its ability to edit and synthesize speech, creators can efficiently produce high-quality audio content without the need for extensive post-production work or specialized recording equipment. Additionally, VoiceCraft can be used to create personalized text-to-speech applications, where users can have their content read aloud in a voice of their choice.

Things to try

One interesting thing to try with VoiceCraft is to use it for speech-to-speech translation. By providing the model with an audio clip in one language and the transcript in the target language, it can generate the translated audio in the voice of the original speaker. This can be particularly useful for international collaborations or accessibility purposes.

Another idea is to explore the model's capabilities for audio restoration and enhancement. By providing VoiceCraft with a low-quality audio recording and the desired improvements, it may be able to generate a higher-quality version of the audio, while preserving the original speaker's voice.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

