Ggerganov

Models by this creator

🤿

whisper.cpp

ggerganov

Total Score

585

whisper.cpp is a collection of OpenAI's Whisper models that have been converted to the ggml format by the maintainer ggerganov. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data. It demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. Similar Whisper models available on the Hugging Face Hub include whisper-large-v3, whisper-tiny.en, whisper-large-v2, whisper-small, and whisper-large. These models vary in size and capabilities, with the larger models generally performing better but requiring more compute resources. Model Inputs and Outputs The whisper.cpp models take audio as input and output text transcriptions. The models are informed of the task to perform (transcription or translation) by passing "context tokens" to the decoder at the start of the decoding process. Inputs Audio data Outputs Text transcriptions or translations Capabilities The whisper.cpp models exhibit improved robustness to accents, background noise, and technical language compared to many existing ASR systems. They also demonstrate the ability to perform zero-shot translation from multiple languages into English. What Can I Use It For? The whisper.cpp models can be used for a variety of audio-to-text applications, such as: Improving accessibility tools by providing speech-to-text capabilities Enabling near-real-time speech recognition and translation by building applications on top of the models Automating transcription and translation of large volumes of audio data While the models show strong performance, the maintainers caution against using them for high-risk applications or subjective classification tasks, as they may exhibit disparate performance across languages, accents, and demographics. Things to Try One interesting aspect of the whisper.cpp models is their ability to perform long-form transcription by using a chunking algorithm. This allows the models to transcribe audio samples of arbitrary length, rather than being limited to short 30-second clips. You can experiment with this functionality using the Transformers pipeline in Python. Another interesting area to explore is fine-tuning the pre-trained Whisper models on specific datasets or tasks. The maintainers provide a blog post with a step-by-step guide on how to fine-tune the Whisper model with as little as 5 hours of labeled data, which can further improve the models' performance for your particular use case.

Read more

Updated 5/17/2024