incredibly-fast-whisper

Maintainer: vaibhavs10 - Last updated 12/13/2024

incredibly-fast-whisper

Model overview

The incredibly-fast-whisper model is an opinionated CLI tool built on top of the OpenAI Whisper large-v3 model, which is designed to enable blazingly fast audio transcription. Powered by Hugging Face Transformers, Optimum, and Flash Attention 2, the model can transcribe 150 minutes of audio in less than 98 seconds, a significant performance improvement over the standard Whisper model. This tool is part of a community-driven project started by vaibhavs10 to showcase advanced Transformers optimizations.

The incredibly-fast-whisper model is comparable to other Whisper-based models like whisperx, whisper-diarization, and metavoice, each of which offers its own unique set of features and optimizations for speech-to-text transcription.

Model inputs and outputs

Inputs

  • Audio file: The primary input for the incredibly-fast-whisper model is an audio file, which can be provided as a local file path or a URL.
  • Task: The model supports two main tasks: transcription (the default) and translation to another language.
  • Language: The language of the input audio, which can be specified or left as "None" to allow the model to auto-detect the language.
  • Batch size: The number of parallel batches to compute, which can be adjusted to avoid out-of-memory (OOM) errors.
  • Timestamp format: The model can output timestamps at either the chunk or word level.
  • Diarization: The model can use Pyannote.audio to perform speaker diarization, but this requires providing a Hugging Face API token.

Outputs

The primary output of the incredibly-fast-whisper model is a transcription of the input audio, which can be saved to a JSON file.

Capabilities

The incredibly-fast-whisper model leverages several advanced optimizations to achieve its impressive transcription speed, including the use of Flash Attention 2 and BetterTransformer. These optimizations allow the model to significantly outperform the standard Whisper large-v3 model in terms of transcription speed, while maintaining high accuracy.

What can I use it for?

The incredibly-fast-whisper model is well-suited for applications that require real-time or near-real-time audio transcription, such as live captioning, podcast production, or meeting transcription. The model's speed and efficiency make it a compelling choice for these types of use cases, especially when dealing with large amounts of audio data.

Things to try

One interesting feature of the incredibly-fast-whisper model is its support for the distil-whisper/large-v2 checkpoint, which is a smaller and more efficient version of the Whisper model. Users can experiment with this checkpoint to find the right balance between speed and accuracy for their specific use case.

Additionally, the model's ability to leverage Flash Attention 2 and BetterTransformer optimizations opens up opportunities for further experimentation and customization. Users can explore different configurations of these optimizations to see how they impact transcription speed and quality.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Total Score

4.2K

Follow @aimodelsfyi on 𝕏 →

Related Models

whisper
Total Score

48.8K

whisper

openai

Whisper is a general-purpose speech recognition model developed by OpenAI. It is capable of converting speech in audio to text, with the ability to translate the text to English if desired. Whisper is based on a large Transformer model trained on a diverse dataset of multilingual and multitask speech recognition data. This allows the model to handle a wide range of accents, background noises, and languages. Similar models like whisper-large-v3, incredibly-fast-whisper, and whisper-diarization offer various optimizations and additional features built on top of the core Whisper model. Model inputs and outputs Whisper takes an audio file as input and outputs a text transcription. The model can also translate the transcription to English if desired. The input audio can be in various formats, and the model supports a range of parameters to fine-tune the transcription, such as temperature, patience, and language. Inputs Audio**: The audio file to be transcribed Model**: The specific version of the Whisper model to use, currently only large-v3 is supported Language**: The language spoken in the audio, or None to perform language detection Translate**: A boolean flag to translate the transcription to English Transcription**: The format for the transcription output, such as "plain text" Initial Prompt**: An optional initial text prompt to provide to the model Suppress Tokens**: A list of token IDs to suppress during sampling Logprob Threshold**: The minimum average log probability threshold for a successful transcription No Speech Threshold**: The threshold for considering a segment as silence Condition on Previous Text**: Whether to provide the previous output as a prompt for the next window Compression Ratio Threshold**: The maximum compression ratio threshold for a successful transcription Temperature Increment on Fallback**: The temperature increase when the decoding fails to meet the specified thresholds Outputs Transcription**: The text transcription of the input audio Language**: The detected language of the audio (if language input is None) Tokens**: The token IDs corresponding to the transcription Timestamp**: The start and end timestamps for each word in the transcription Confidence**: The confidence score for each word in the transcription Capabilities Whisper is a powerful speech recognition model that can handle a wide range of accents, background noises, and languages. The model is capable of accurately transcribing audio and optionally translating the transcription to English. This makes Whisper useful for a variety of applications, such as real-time captioning, meeting transcription, and audio-to-text conversion. What can I use it for? Whisper can be used in various applications that require speech-to-text conversion, such as: Captioning and Subtitling**: Automatically generate captions or subtitles for videos, improving accessibility for viewers. Meeting Transcription**: Transcribe audio recordings of meetings, interviews, or conferences for easy review and sharing. Podcast Transcription**: Convert audio podcasts to text, making the content more searchable and accessible. Language Translation**: Transcribe audio in one language and translate the text to another, enabling cross-language communication. Voice Interfaces**: Integrate Whisper into voice-controlled applications, such as virtual assistants or smart home devices. Things to try One interesting aspect of Whisper is its ability to handle a wide range of languages and accents. You can experiment with the model's performance on audio samples in different languages or with various background noises to see how it handles different real-world scenarios. Additionally, you can explore the impact of the different input parameters, such as temperature, patience, and language detection, on the transcription quality and accuracy.

Read more

Updated 12/13/2024

Audio-to-Text
insanely-fast-whisper-with-video
Total Score

110

insanely-fast-whisper-with-video

turian

The insanely-fast-whisper-with-video model, created by turian, is a powerful AI-based audio transcription tool that leverages the impressive capabilities of OpenAI's Whisper Large v3 model. This model boasts incredible speed, allowing users to transcribe up to 150 minutes of audio in less than 98 seconds on a Nvidia A100 - 80GB GPU. The model also supports video transcription, making it a versatile tool for a wide range of applications. The insanely-fast-whisper-with-video model builds upon the work of chenxwh/insanely-fast-whisper and adidoes/cog-whisperx-video-transcribe, leveraging techniques like fp16 precision, batching, Flash Attention 2, and bettertransformer to achieve these impressive transcription speeds. Model inputs and outputs Inputs File Name**: The path or URL to the audio or video file to be transcribed. Task**: The task to be performed, either transcription or translation. Language**: The language of the input audio (optional, Whisper can auto-detect the language). Batch Size**: The number of parallel batches to compute, adjustable to avoid Out-Of-Memory (OOM) issues. Timestamp**: The type of timestamp to generate, either chunked or word-level. Diarise Audio**: Whether to use Pyannote.audio to diarise the audio clips, which requires a Hugging Face token. Outputs The transcription output, which can be saved to a specified file path. Capabilities The insanely-fast-whisper-with-video model is capable of transcribing audio files with exceptional speed and accuracy, thanks to the use of advanced techniques like Flash Attention 2. The model can handle a wide range of audio formats and can even transcribe video files by downloading the audio using yt-dlp. What can I use it for? The insanely-fast-whisper-with-video model is a versatile tool that can be used in a variety of applications, such as: Automated Subtitling**: Quickly generate accurate subtitles for videos, making them more accessible to a wider audience. Audio Transcription**: Efficiently transcribe audio recordings for use in various contexts, such as meeting minutes, interviews, or podcasts. Language Translation**: Translate audio content from one language to another, enabling seamless communication across languages. Media Production**: Streamline the post-production process by automating the transcription of audio and video content. Things to try One interesting aspect of the insanely-fast-whisper-with-video model is its ability to leverage Flash Attention 2, a technique that can significantly improve the transcription speed without sacrificing accuracy. Users can experiment with enabling or disabling this feature to see the impact on their specific use cases. Additionally, the model supports both the Whisper Large v3 and Distil Whisper Large v2 models, allowing users to choose the one that best fits their requirements in terms of speed and accuracy.

Read more

Updated 12/13/2024

Video-to-Text
whisper-large-v3
Total Score

3

whisper-large-v3

nateraw

whisper-large-v3 is a powerful speech recognition model developed by OpenAI. It is a general-purpose model trained on a large, diverse dataset of audio data, enabling it to perform a wide range of speech processing tasks, including multilingual speech recognition, speech translation, and language identification. The model uses a Transformer-based sequence-to-sequence architecture, which allows it to handle various speech-related tasks in a unified manner. This approach contrasts with traditional speech processing pipelines that require multiple specialized components. The model is trained on a set of special tokens that serve as task specifiers or classification targets, enabling it to handle these tasks seamlessly. Similar models like whisper, incredibly-fast-whisper, whisperx, whisper-diarization, and whisperx-a40-large build upon the capabilities of whisper-large-v3, offering various optimizations and additional features. Model inputs and outputs Inputs filepath**: Path to the audio file to be transcribed language**: The source language of the audio, if known (e.g., English, French, Japanese) return_timestamps**: Whether to return timestamps for each transcribed chunk translate**: Whether to translate the transcribed text to English Outputs The transcribed text from the input audio file Capabilities whisper-large-v3 is capable of accurately transcribing speech in a wide range of languages, including English, Mandarin Chinese, German, Spanish, Russian, and many others. It can also translate speech from the source language to English, making it a valuable tool for multilingual communication and content creation. One of the key strengths of the model is its ability to handle diverse audio input, including audio with background noise, accents, and varying recording quality. This makes it suitable for a variety of real-world applications, from transcribing interviews and meetings to captioning videos and podcasts. What can I use it for? whisper-large-v3 can be leveraged in numerous applications, such as: Transcription and captioning**: Automatically transcribe audio content, including podcasts, interviews, and video recordings, to generate text-based transcripts or captions. Multilingual communication**: Translate spoken language in real-time, enabling seamless communication across language barriers. Voice interfaces**: Integrate speech recognition capabilities into conversational interfaces, virtual assistants, and other voice-based applications. Accessibility and inclusion**: Provide accessible content for individuals with hearing impairments or language barriers. Content creation**: Streamline the process of creating transcripts, subtitles, and multi-lingual content for various media formats. Things to try One interesting aspect of whisper-large-v3 is its ability to handle a wide range of languages. You can experiment with transcribing audio in different languages and observe how the model performs. Additionally, you can try the translation feature to see how it converts speech from the source language to English. Another area to explore is the model's robustness to various audio conditions, such as background noise, accents, and recording quality. By testing the model with diverse audio samples, you can assess its real-world performance and identify any potential limitations or areas for improvement. Overall, whisper-large-v3 is a powerful and versatile speech recognition model that can be leveraged in a wide range of applications. Its multitasking capabilities and strong performance make it an attractive choice for many speech-related projects.

Read more

Updated 12/13/2024

Audio-to-Text
whisperx
Total Score

52

whisperx

daanelson

whisperx is a Cog implementation of the WhisperX library, which adds batch processing on top of the popular Whisper speech recognition model. This allows for very fast audio transcription compared to the original Whisper model. whisperx is developed and maintained by daanelson. Similar models include whisperx-victor-upmeet, which provides accelerated transcription, word-level timestamps, and diarization with the Whisper large-v3 model, and whisper-diarization-thomasmol, which offers fast audio transcription, speaker diarization, and word-level timestamps. Model inputs and outputs whisperx takes an audio file as input, along with optional parameters to control the batch size, whether to output only the transcribed text or include segment metadata, and whether to print out memory usage information for debugging purposes. Inputs audio**: The audio file to be transcribed batch_size**: The number of audio segments to process in parallel for faster transcription only_text**: A boolean flag to return only the transcribed text, without segment metadata align_output**: A boolean flag to generate word-level timestamps (currently only works for English) debug**: A boolean flag to print out memory usage information Outputs The transcribed text, optionally with segment-level metadata Capabilities whisperx builds on the strong speech recognition capabilities of the Whisper model, providing accelerated transcription through batch processing. This can be particularly useful for transcribing long audio files or processing multiple audio files in parallel. What can I use it for? whisperx can be used for a variety of applications that require fast and accurate speech-to-text transcription, such as podcast production, video captioning, or meeting minutes generation. The ability to process audio in batches and the option to output only the transcribed text can make the model well-suited for high-volume or real-time transcription scenarios. Things to try One interesting aspect of whisperx is the ability to generate word-level timestamps, which can be useful for applications like video editing or language learning. You can experiment with the align_output parameter to see how this feature performs on your audio files. Another thing to try is leveraging the batch processing capabilities of whisperx to transcribe multiple audio files in parallel, which can significantly reduce the overall processing time for large-scale transcription tasks.

Read more

Updated 12/13/2024

Audio-to-Text