Openai
Rank:Average Model Cost: $0.0018
Number of Runs: 25,374,778
Models by this creator
clip-vit-large-patch14
clip-vit-large-patch14
clip-vit-large-patch14 is a model that combines Vision Transformer (ViT) and Contrastive Language-Image Pre-training (CLIP) to perform zero-shot image classification. It is based on a large-scale image dataset and a text dataset, and the model can understand and generate text from images. It can classify images into various categories without any specific training on those categories, making it a powerful tool for tasks such as image recognition and classification.
$-/run
15.9M
Huggingface
clip-vit-base-patch32
clip-vit-base-patch32
The clip-vit-base-patch32 model is a zero-shot image classification model. It is based on the CLIP (Contrastive Language-Image Pretraining) framework and uses the Vision Transformer (ViT) architecture. This model can classify images into various categories without any prior training on specific image datasets. It takes an image as input and generates a probability distribution over a large number of conceptual categories, allowing it to generalize to unseen classes.
$-/run
4.2M
Huggingface

whisper
Whisper is a model that can convert speech from audio files into text. It is specifically designed to transcribe spoken language from a variety of sources, such as recordings or real-time audio streams. This model can be used in various applications, including transcription services, voice assistants, and speech-to-text software. With its accuracy and flexibility, Whisper can effectively convert spoken language into written text, making it easier to process and analyze audio data.
$0.018/run
3.7M
Replicate
clip-vit-base-patch16
clip-vit-base-patch16
clip-vit-base-patch16 is a model that combines the power of image and text understanding to perform zero-shot image classification. It uses a Visual Transformer (ViT) to analyze the image and a Contrastive Language-Image Pretraining (CLIP) model to understand the text. By comparing the similarities between images and text embeddings, clip-vit-base-patch16 can categorize images into various classes without the need for specific training on those categories.
$-/run
811.1K
Huggingface
clip-vit-large-patch14-336
clip-vit-large-patch14-336
clip-vit-large-patch14-336 is a deep learning model that combines the power of vision and language to perform zero-shot image classification. It is built using the CLIP (Contrastive Language-Image Pretraining) framework and utilizes the ViT (Vision Transformer) architecture with a large patch size of 14. This model can take in images as input and generate text-based descriptions or labels that accurately represent the content of the images, even for classes it has never been explicitly trained on.
$-/run
232.1K
Huggingface
whisper-tiny.en
whisper-tiny.en
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It is a Transformer-based encoder-decoder model that was trained on 680k hours of labelled speech data. The model comes in five different sizes and can be used for English speech recognition or multilingual speech recognition and translation. Whisper can be fine-tuned with additional labelled data to further improve its performance. While it demonstrates strong capabilities in ASR and speech translation, it may have limitations such as generating repetitive texts and including hallucinated words. It performs unevenly across languages and accents, and its performance may vary across different demographic criteria. Whisper models can have positive implications for accessibility and real-time speech recognition applications, but there are also concerns about dual use and potential surveillance applications.
$-/run
162.4K
Huggingface
whisper-large-v2
whisper-large-v2
The whisper-large-v2 model is an automatic speech recognition (ASR) model that takes in audio recordings as input and generates the corresponding transcriptions. It is designed to accurately convert spoken language into written text. This model is based on the Whisper ASR system and is trained on a large dataset to achieve better performance.
$-/run
147.4K
Huggingface
whisper-base
whisper-base
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It is a Transformer-based encoder-decoder model trained on 680k hours of labelled speech data. Whisper models can be used for transcription and translation tasks. The model is informed of the task and language by passing context tokens. The models have been evaluated on ASR and speech translation tasks and exhibit strong performance. However, they may have limitations such as generating repetitive texts and hallucinating text that is not present in the audio input. The models show better performance on high-resource languages and may exhibit disparate performance on accents and dialects. The models have potential broader implications for accessibility tools and surveillance technologies. Fine-tuning is possible to improve performance on specific tasks and languages.
$-/run
83.8K
Huggingface
whisper-tiny
whisper-tiny
Whisper Tiny is an automatic speech recognition (ASR) model. ASR is a technology used to convert spoken language into written text. Whisper Tiny is trained on a large dataset of labeled speech data to accurately transcribe spoken words. It is designed to be lightweight and efficient, making it suitable for deployment on resource-constrained devices without sacrificing performance.
$-/run
81.2K
Huggingface
whisper-small
whisper-small
The whisper-small model is an automatic speech recognition (ASR) model. It is designed to convert spoken language into written text. This model is optimized for low-latency and resource-constrained devices. It has been trained on a large amount of multilingual and multitask supervised data. The model performance is evaluated using the character error rate (CER) metric.
$-/run
44.6K
Huggingface