[](#whisperkit-evaluation-results)WhisperKit Evaluation Results
===============================================================

[](#dataset-librispeech)Dataset: `librispeech`
----------------------------------------------

### [](#whisperkit--openai_whisper-large-v3-optimized-variants)WhisperKit + `openai_whisper-large-v3` (+optimized variants)

WER

QoI (%)

File Size (MB)

[openai\_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3)

2.44

100

3100

[openai\_whisper-large-v3\_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo)

2.41

99.8

3100

[openai\_whisper-large-v3\_turbo\_1307MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1307MB)

2.6

97.7

1307

[openai\_whisper-large-v3\_turbo\_1049MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1049MB)

4.81

91

1049

[openai\_whisper-large-v3\_1053MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_1053MB)

4.65

90.8

1053

### [](#different-projects--openai_whisper-large-v3)Different Projects + `openai_whisper-large-v3`

WER

Commit Hash

Model Format

[WhisperKit](https://github.com/argmaxinc/whisperkit)

[2.44](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperKit/openai_whisper-large-v3/librispeech)

0f8b4fe

Core ML

[WhisperCpp](https://github.com/ggerganov/whisper.cpp)

[2.36](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/whisper.cpp/openai_whisper-large-v3/librispeech)

e72e415

Core ML + GGUF

[WhisperMLX](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)

[2.69](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperMLX/openai_whisper-large-v3/librispeech)

614de66

MLX (Numpy)

### [](#quality-of-inference-qoi-certification)Quality-of-Inference (QoI) Certification

We believe that rigorously measuring the quality of inference is necessary for developers and enterprises to make informed decisions when opting to use optimized or compressed variants of Whisper models in production. The current measurements are between reference and optimized WhisperKit models. We are going to extend the scope of this measurement to other Whisper implementations soon so developers can certify the behavior change (if any) caused by alternating use of WhisperKit with (or migration from) these implementations.

In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below) which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat across updates). Pseudocode for `qoi`:

    qoi = []
    for example in dataset:
        no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
        qoi.append(no_regression)
    qoi = (sum(qoi) / len(qoi)) * 100.
    

We define the reference model as the default float16 precision Core ML model that is generated by whisperkittools. This reference model matches the accuracy of the original PyTorch model on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips) as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22` (120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset) for details.

### [](#reproducing-results)Reproducing Results

Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners), we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to run identical [evaluation jobs](#evaluation) locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3` evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than 1 day to complete the same evaluation.

Glossary:

*   `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
    
*   `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like `_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.

## Model overview

[`whisperkit-coreml_01-30-24`](https://aimodels.fyi/creators/huggingFace/argmaxinc) is a set of optimized variants of the [OpenAI Whisper](https://huggingface.co/openai/whisper-large-v3) model, created by the maintainer [argmaxinc](https://aimodels.fyi/creators/huggingFace/argmaxinc). These variants aim to provide improved performance and reduced model size compared to the original Whisper large-v3 model, while maintaining high quality of inference (QoI). The models were evaluated on the LibriSpeech dataset, with the best-performing variant achieving a word error rate (WER) of 2.41% and a QoI of 99.8%, while reducing the model size to 3100 MB.

## Model inputs and outputs

### Inputs
- Audio data in the form of log-mel spectrograms

### Outputs
- Transcribed text in the target language (English)

## Capabilities

The `whisperkit-coreml_01-30-24` models demonstrate improved robustness and performance compared to the original Whisper large-v3 model, particularly on the LibriSpeech dataset. The optimized variants offer significantly reduced model size and latency, making them more suitable for deployment on resource-constrained devices or in real-time applications.

## What can I use it for?

The `whisperkit-coreml_01-30-24` models can be used for a variety of speech recognition tasks, such as transcribing audio recordings, enabling voice-controlled interfaces, or improving accessibility for the hearing impaired. The reduced model size and latency also make these models suitable for integration into mobile apps, edge devices, or other applications where computational resources are limited.

## Things to try

Developers can explore using the `whisperkit-coreml_01-30-24` models in their speech recognition pipelines, either as a drop-in replacement for the original Whisper large-v3 model or as a component in more complex audio processing workflows. Additionally, researchers may be interested in further analyzing the tradeoffs between model size, latency, and QoI to inform the development of even more efficient speech recognition models.

[](#whisperkit-transcription-quality)WhisperKit Transcription Quality
=====================================================================

[](#dataset-librispeech)Dataset: `librispeech`
----------------------------------------------

Short-form Audio (<30s/clip) - 5 hours of English audiobook clips

WER ()

QoI ()

File Size (MB)

Code Commit

large-v2 (WhisperOpenAIAPI)

[2.35](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech)

100

3100

N/A

[large-v2](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2)

[2.77](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech)

96.6

3100

[Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9)

[large-v2\_949MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_949MB)

[2.4](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_949MB/librispeech)

94.6

949

[Link](https://github.com/argmaxinc/WhisperKit/commit/eca4a2e)

[large-v2\_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_turbo)

[2.76](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech)

96.6

3100

[Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9)

[large-v2\_turbo\_955MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_turbo_955MB)

[2.41](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_955MB/librispeech)

94.6

955

[Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348)

[large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3)

[2.04](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech)

95.2

3100

[Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9)

[large-v3\_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3_turbo)

[2.03](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech)

95.4

3100

[Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9)

[large-v3\_turbo\_954MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3_turbo_954MB)

[2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_954MB/librispeech)

93.9

954

[Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348)

[distil-large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3)

[2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3/librispeech)

89.7

1510

[Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348)

[distil-large-v3\_594MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_594MB)

[2.96](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_594MB/librispeech)

85.4

594

[Link](https://github.com/argmaxinc/WhisperKit/commit/508240f)

[distil-large-v3\_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_turbo)

[2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_turbo/librispeech)

89.7

1510

[Link](https://github.com/argmaxinc/WhisperKit/commit/508240f)

[distil-large-v3\_turbo\_600MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_turbo_600MB)

[2.78](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_turbo_600MB/librispeech)

86.2

600

[Link](https://github.com/argmaxinc/WhisperKit/commit/ae1cf96)

[small.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-small.en)

[3.12](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech)

85.8

483

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[small](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-small)

[3.45](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech)

83

483

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[base.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base.en)

[3.98](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech)

75.3

145

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[base](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base)

[4.97](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech)

67.2

145

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[tiny.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny.en)

[5.61](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech)

63.9

66

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[tiny](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny)

[7.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech)

52.5

66

[Link](https://github.com/argmaxinc/WhisperKit/commit/228630c)

[](#dataset-earnings22)Dataset: `earnings22`
--------------------------------------------

Long-Form Audio (>1hr/clip) - 120 hours of earnings call recordings in English with various accents

WER ()

QoI ()

File Size (MB)

Code Commit

large-v2 (WhisperOpenAIAPI)

[16.27](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22)

100

3100

N/A

[large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3)

[15.17](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/earnings22)

58.5

3100

[Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9)

[distil-large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3)

[15.28](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3/earnings22)

46.3

1510

[Link](https://github.com/argmaxinc/WhisperKit/commit/508240f)

[base.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base.en)

[23.49](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/earnings22)

6.5

145

[Link](https://github.com/argmaxinc/WhisperKit/commit/dda6571)

[tiny.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny.en)

[28.64](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/earnings22)

5.7

66

[Link](https://github.com/argmaxinc/WhisperKit/commit/dda6571)

### [](#explanation)Explanation

We believe that rigorously measuring the quality of inference is necessary for developers and enterprises to make informed decisions when opting to use optimized or compressed variants of any machine learning model in production. To contextualize `WhisperKit`, we take the following Whisper implementations and benchmark them using a consistent evaluation harness:

Server-side:

*   `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)

($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)

On-device:

*   `WhisperKit`: Argmax's implementation [\[Eval Harness\]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [\[Repo\]](https://github.com/argmaxinc/WhisperKit)
*   `whisper.cpp`: A C++ implementation form ggerganov [\[Eval Harness\]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [\[Repo\]](https://github.com/ggerganov/whisper.cpp)
*   `WhisperMLX`: A Python implementation from Apple MLX [\[Eval Harness\]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [\[Repo\]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)

(All on-device implementations are available for free under MIT license as of 03/19/2024)

`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below) which is a stricter metric compared to dataset average [Word Error RATE (WER)](https://en.wikipedia.org/wiki/Word_error_rate). A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat across updates). Pseudocode for `qoi`:

    qoi = []
    for example in dataset:
        no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
        qoi.append(no_regression)
    qoi = (sum(qoi) / len(qoi)) * 100.
    

Note that the ordering of models with respect to `WER` does not necessarily match the ordering with respect to `QoI`. This is because the reference model gets assigned a QoI of 100% by definition. Any per-example regression by other implementations get penalized while per-example improvements are not rewarded. `QoI` (higher is better) matters where the production behavior is established by the reference results and the goal is to not regress when switching to an optimized or compressed model. On the other hand, `WER` (lower is better) matters when there is no established production behavior and one is picking the best quality versus model size trade off point.

We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and [whisperkittools](https://github.com/argmaxinc/whisperkittools) offers the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](/argmaxinc/whisperkit-coreml/blob/main/(https://github.com/argmaxinc/whisperkittools)) for details.

### [](#why-are-there-so-many-whisper-versions)Why are there so many Whisper versions?

WhisperKit is an SDK for building speech-to-text features in apps across a wide range of Apple devices. We are working towards abstracting away the model versioning from the developer so WhisperKit "just works" by deploying the highest-quality model version that a particular device can execute. In the interim, we leave the choice to the developer by providing quality and size trade-offs.

### [](#datasets)Datasets

*   [librispeech](https://huggingface.co/datasets/argmaxinc/librispeech): ~5 hours of short English audio clips, tests short-form transcription quality
*   [earnings22](https://huggingface.co/datasets/argmaxinc/earnings22): ~120 hours of English audio clips from earnings calls with various accents, tests long-form transcription quality

### [](#reproducing-results)Reproducing Results

Benchmark results on this page were automatically generated by [whisperkittools](https://github.com/argmaxinc/whisperkittools) using our cluster of Apple Silicon Macs as self-hosted runners on Github Actions. We periodically recompute these benchmarks as part of our CI pipeline. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners), we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to run identical [evaluation jobs](#evaluation) locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3` evaluation in under 1 hour regardless of the Whisper implementation. Oldest Apple Silicon Macs should take less than 1 day to complete the same evaluation.

### [](#glossary)Glossary

*   `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
    
*   `_*MB`: Indicates the presence of model compression. Instead of cluttering the filename with details like `_AudioEncoder-5.8bits_TextDecoder-6.1bits_QLoRA-rank=16`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.

## Model overview

The `whisperkit-coreml` model is a set of AI models developed by the maintainer `argmaxinc` that aim to improve on the capabilities of the OpenAI Whisper model. Whisper is a pre-trained automatic speech recognition (ASR) and speech translation model that has demonstrated strong performance on a wide range of speech datasets and domains. 

The `whisperkit-coreml` models are optimized versions of the OpenAI Whisper large-v2 and large-v3 models, with various compression and optimization techniques applied to reduce the model size and improve inference speed. These include mixed-bit quantization, model pruning, and architectural changes. The maintainer has evaluated these optimized models on the LibriSpeech dataset and provided detailed performance metrics, including word error rate (WER) and quality of inference (QoI).

In comparison, the [whisperkit-coreml_01-30-24](https://aimodels.fyi/models/huggingFace/whisperkit-coreml01-30-24-argmaxinc) model from the same maintainer achieves similar or better performance in terms of WER and QoI, with additional optimizations for streaming transcription and support for different Whisper implementations.

## Model inputs and outputs

### Inputs
- **Audio files**: The `whisperkit-coreml` models take audio files as input and process them to generate text transcriptions.

### Outputs
- **Text transcriptions**: The models output text transcriptions of the input audio, with the option to include or exclude timestamp information.

## Capabilities

The `whisperkit-coreml` models demonstrate strong performance on speech recognition tasks, with the optimized variants achieving WER scores as low as 2.03 on the LibriSpeech dataset. The maintainer has also measured the quality of inference (QoI), which is a more rigorous metric that ensures no regressions in per-example behavior compared to the original Whisper model.

The optimized models offer significantly reduced file sizes, ranging from 949MB to 3100MB, allowing for more efficient deployment and inference. This makes them suitable for a variety of applications that require accurate and efficient speech-to-text transcription, such as accessibility tools, content moderation, and media production.

## What can I use it for?

The `whisperkit-coreml` models can be used for a variety of speech recognition and transcription tasks, such as:

- **Accessibility**: Generating accurate transcriptions of audio content to improve accessibility for people with hearing impairments.
- **Content moderation**: Automatically transcribing audio content to enable text-based moderation and filtering.
- **Media production**: Streamlining the transcription process for audio and video content, reducing the time and effort required for tasks like captioning and subtitling.

The maintainer, [argmaxinc](https://aimodels.fyi/creators/huggingFace/argmaxinc), has made the models available through the Hugging Face Hub, allowing developers to easily integrate them into their applications and take advantage of the performance improvements.

## Things to try

One interesting aspect of the `whisperkit-coreml` models is the maintainer's focus on measuring and ensuring "quality of inference" (QoI), a more rigorous metric than just comparing average WER scores. This approach helps to identify and mitigate any potential regressions in per-example performance compared to the original Whisper model, which is crucial for real-world applications where consistency and reliability are essential.

Developers could explore using the `whisperkit-coreml` models in their own custom speech recognition pipelines and evaluate the impact of the QoI improvements on the end-user experience. Additionally, they could experiment with different compression and optimization techniques to further tune the models for their specific use cases and deployment environments.