whisper-large-v3, incredibly fast, with video transcription

## Model overview

The `insanely-fast-whisper-with-video` model, created by [turian](https://aimodels.fyi/creators/replicate/turian), is a powerful AI-based audio transcription tool that leverages the impressive capabilities of OpenAI's [Whisper Large v3](https://huggingface.co/openai/whisper-large-v3) model. This model boasts incredible speed, allowing users to transcribe up to 150 minutes of audio in less than 98 seconds on a Nvidia A100 - 80GB GPU. The model also supports video transcription, making it a versatile tool for a wide range of applications.

The `insanely-fast-whisper-with-video` model builds upon the work of [chenxwh/insanely-fast-whisper](https://github.com/chenxwh/insanely-fast-whisper) and [adidoes/cog-whisperx-video-transcribe](https://github.com/adidoes/cog-whisperx-video-transcribe), leveraging techniques like `fp16` precision, `batching`, `Flash Attention 2`, and `bettertransformer` to achieve these impressive transcription speeds.

## Model inputs and outputs

### Inputs
- **File Name**: The path or URL to the audio or video file to be transcribed.
- **Task**: The task to be performed, either transcription or translation.
- **Language**: The language of the input audio (optional, Whisper can auto-detect the language).
- **Batch Size**: The number of parallel batches to compute, adjustable to avoid Out-Of-Memory (OOM) issues.
- **Timestamp**: The type of timestamp to generate, either chunked or word-level.
- **Diarise Audio**: Whether to use Pyannote.audio to diarise the audio clips, which requires a Hugging Face token.

### Outputs
- The transcription output, which can be saved to a specified file path.

## Capabilities

The `insanely-fast-whisper-with-video` model is capable of transcribing audio files with exceptional speed and accuracy, thanks to the use of advanced techniques like `Flash Attention 2`. The model can handle a wide range of audio formats and can even transcribe video files by downloading the audio using `yt-dlp`.

## What can I use it for?

The `insanely-fast-whisper-with-video` model is a versatile tool that can be used in a variety of applications, such as:

- **Automated Subtitling**: Quickly generate accurate subtitles for videos, making them more accessible to a wider audience.
- **Audio Transcription**: Efficiently transcribe audio recordings for use in various contexts, such as meeting minutes, interviews, or podcasts.
- **Language Translation**: Translate audio content from one language to another, enabling seamless communication across languages.
- **Media Production**: Streamline the post-production process by automating the transcription of audio and video content.

## Things to try

One interesting aspect of the `insanely-fast-whisper-with-video` model is its ability to leverage `Flash Attention 2`, a technique that can significantly improve the transcription speed without sacrificing accuracy. Users can experiment with enabling or disabling this feature to see the impact on their specific use cases.

Additionally, the model supports both the Whisper Large v3 and Distil Whisper Large v2 models, allowing users to choose the one that best fits their requirements in terms of speed and accuracy.