Laion
Rank:Average Model Cost: $0.0000
Number of Runs: 13,924,054
Models by this creator
CLIP-ViT-B-16-laion2B-s34B-b88K
CLIP-ViT-B-16-laion2B-s34B-b88K
CLIP-ViT-B-16-laion2B-s34B-b88K is a CLIP ViT-B/16 model trained with the LAION-2B English subset of the LAION-5B dataset. It is intended for research purposes and can be used for zero-shot image classification, image and text retrieval, image classification fine-tuning, linear probe image classification, and image generation guiding and conditioning. The model achieves a 70.2 zero-shot top-1 accuracy on ImageNet-1k. It has not been tested or evaluated on languages other than English. The training dataset is uncurated and contains potentially disturbing content. It is recommended to use the dataset for research purposes only and exercise caution when accessing the links. The model's use for deployed, commercial, surveillance, facial recognition, and non-English language cases is out-of-scope.
$-/run
10.9M
Huggingface
CLIP-ViT-H-14-laion2B-s32B-b79K
CLIP-ViT-H-14-laion2B-s32B-b79K
CLIP-ViT-H-14-laion2B-s32B-b79K is a zero-shot image classification model. It is built using the vision transformer (ViT) architecture and has been fine-tuned on a large dataset. This model can classify images into a wide range of categories, even for categories that it has not been explicitly trained on. It achieves this by leveraging a contrastive learning framework that aligns images and their textual descriptions.
$-/run
1.4M
Huggingface
CLIP-ViT-B-32-laion2B-s34B-b79K
CLIP-ViT-B-32-laion2B-s34B-b79K
CLIP-ViT-B-32-laion2B-s34B-b79K is a CLIP ViT-B/32 model trained with the LAION-2B English subset of the LAION-5B dataset. It is intended for research purposes and can be used for zero-shot image classification, image and text retrieval, image classification fine-tuning, linear probe image classification, and image generation guiding and conditioning. The model achieved a zero-shot top-1 accuracy of 66.6% on ImageNet-1k. It was trained using 2 billion samples from the LAION-5B dataset. The dataset is uncurated and may contain disturbing content, so caution is advised when using it. The model's performance has not been evaluated on languages other than English. Proper testing and evaluation is recommended before deploying the model in any use case. The model was trained on the stability.ai cluster and can be cited by acknowledging stability.ai, the LAION-5B paper, the OpenAI CLIP paper, and the OpenCLIP software. Code snippets for getting started with the model are provided.
$-/run
1.2M
Huggingface
CLIP-ViT-bigG-14-laion2B-39B-b160k
CLIP-ViT-bigG-14-laion2B-39B-b160k
The CLIP-ViT-bigG-14-laion2B-39B-b160k model is a zero-shot image classification model. It uses a combination of Contrastive Language-Image Pretraining (CLIP) and Vision Transformer (ViT) techniques to perform image classification tasks. This model is trained on a large-scale dataset and can understand images based on the accompanying text descriptions without needing additional training for specific labels.
$-/run
283.3K
Huggingface
mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k
mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k
The mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k model is a computer vision model that has been trained on the MSCOCO dataset and fine-tuned using the CoCa-ViT architecture. It is capable of generating descriptive text captions for images. The model has been pretrained on a large-scale dataset and fine-tuned to improve its performance on image captioning tasks.
$-/run
57.8K
Huggingface
CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
The CLIP ViT-B/32 xlm roberta base model, trained with the LAION-5B dataset, is a model that can be used for zero-shot image classification, image and text retrieval, as well as downstream tasks such as image classification fine-tuning, linear probe image classification, and image generation guiding and conditioning. The model was trained with a batch size of 90k for 13B samples of the LAION-5B dataset. It achieves competitive results on several benchmarks such as imagenet 1k, MSCOCO, and flickr30k. The model demonstrates good performance in multilingual evaluation as well. The code and resources for this model are available on the OpenCLIP GitHub repository.
$-/run
37.2K
Huggingface
CLIP-ViT-L-14-laion2B-s32B-b82K
CLIP-ViT-L-14-laion2B-s32B-b82K
CLIP-ViT-L-14-laion2B-s32B-b82K is a zero-shot image classification model. It is a variant of the Vision Transformer (ViT) architecture that uses Contrastive Language-Image Pretraining (CLIP) as its training objective. This model can classify images into a wide range of different categories without being specifically trained on those categories. It learns to understand the relationship between images and natural language descriptions, enabling it to generalize to unseen classifications.
$-/run
33.6K
Huggingface
CLIP-ViT-L-14-DataComp.XL-s13B-b90K
CLIP-ViT-L-14-DataComp.XL-s13B-b90K
The CLIP-ViT-L-14-DataComp.XL-s13B-b90K model is a zero-shot image classification model. It is based on CLIP (Contrastive Language-Image Pretraining) and uses Vision Transformers (ViTs) as the backbone architecture. The model has been trained on a large dataset and is able to classify images into various categories without being directly trained on those specific categories.
$-/run
33.0K
Huggingface
CLIP-ViT-g-14-laion2B-s34B-b88K
CLIP-ViT-g-14-laion2B-s34B-b88K
CLIP-ViT-g-14-laion2B-s34B-b88K is a model that combines the power of Vision Transformer (ViT) and Contrastive Language-Image Pretraining (CLIP) to perform zero-shot image classification. It is a highly capable model that can understand and classify images based on natural language descriptions without the need for any additional training or fine-tuning. The model has been trained on a large-scale dataset with 14 billion image-text pairs and achieves high accuracy across a wide range of image classification tasks.
$-/run
23.5K
Huggingface
CLIP-convnext_base_w-laion2B-s13B-b82K-augreg
CLIP-convnext_base_w-laion2B-s13B-b82K-augreg
The CLIP-convnext_base_w-laion2B-s13B-b82K-augreg model is a series of CLIP ConvNeXt-Base models trained on subsets of the LAION-5B dataset using OpenCLIP. These models utilize the timm ConvNeXt-Base model for the image tower and the same text tower as the RN50x4 model in OpenAI CLIP. The models were trained for 13 billion samples and have a zero-shot top-1 accuracy of >= 70.8% on ImageNet-1k. They were trained with increased augmentation and regularization techniques. The model can be used for zero-shot image classification, image and text retrieval, and other downstream tasks such as fine-tuning and linear probe image classification. The models were trained on an uncurated dataset, so caution should be exercised when using the model.
$-/run
13.7K
Huggingface