Timm

Rank:

Average Model Cost: $0.0000

Number of Runs: 4,573,829

Models by this creator

vit_large_patch14_clip_224.openai_ft_in12k_in1k

vit_large_patch14_clip_224.openai_ft_in12k_in1k

timm

vit_large_patch14_clip_224.openai_ft_in12k_in1k is a Vision Transformer (ViT) model that has been pretrained on WIT-400M image-text pairs using CLIP and fine-tuned on ImageNet-12k and then ImageNet-1k. It is designed for image classification tasks and can also be used for generating image embeddings. The model has 304.2 million parameters and achieves an accuracy of 57.1 activations. Its input images are of size 224 x 224. The model has been described in the papers "Learning Transferable Visual Models From Natural Language Supervision," "Reproducible scaling laws for contrastive language-image learning," and "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." The pretrained model is available for use and its performance can be compared with other models using the timm model results. Citation information for the model is also provided.

Read more

$-/run

1.9M

Huggingface

resnet50.a1_in1k

resnet50.a1_in1k

resnet50.a1_in1k is an image classification model based on the ResNet-B architecture. It features ReLU activations, a single layer 7x7 convolution with pooling, and a 1x1 convolution shortcut downsample. The model has been trained on the ImageNet-1k dataset using the ResNet Strikes Back A1 recipe, which includes the LAMB optimizer with BCE loss and a cosine learning rate schedule with warmup. The model has 25.6 million parameters, 4.1 GMACs, and 11.1 million activations. It is suitable for image classification tasks, feature map extraction, and image embeddings. The model's details, usage instructions, and citations can be found in the provided GitHub link.

Read more

$-/run

879.2K

Huggingface

vit_base_patch16_224.augreg_in21k

vit_base_patch16_224.augreg_in21k

The vit_base_patch16_224.augreg_in21k model is a Vision Transformer (ViT) image classification model. It has been trained on the ImageNet-21k dataset with additional augmentation and regularization techniques. The model has approximately 102.6 million parameters and performs image classification on images with a size of 224 x 224 pixels. It can also be used for generating image embeddings. The model is based on the papers "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers" and "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The original implementation of the model is in JAX, but it has been ported to PyTorch by Ross Wightman.

Read more

$-/run

550.4K

Huggingface

vit_base_r50_s16_384.orig_in21k_ft_in1k

vit_base_r50_s16_384.orig_in21k_ft_in1k

The vit_base_r50_s16_384.orig_in21k_ft_in1k model is an image classification model that is based on the Vision Transformer (ViT) architecture. It has a ResNet backbone with 50 layers and a patch size of 16x16. The model is trained on a large dataset of 1.28 million images from the ImageNet-21K dataset, and then fine-tuned on a smaller dataset of 1,000 classes from the ImageNet-1K dataset. Its input size is 384x384. This model can be used to classify images into different categories.

Read more

$-/run

145.0K

Huggingface

vit_small_patch14_dinov2.lvd142m

vit_small_patch14_dinov2.lvd142m

The vit_small_patch14_dinov2.lvd142 model is a Vision Transformer (ViT) image feature model that has been pretrained on the LVD-142M dataset using the self-supervised DINOv2 method. It can be used for image classification tasks as well as generating image embeddings. The model has 22.1 million parameters, 46.8 billion GMACs, and 198.8 million activations. It has been trained on images of size 518 x 518. The model is based on the DINOv2 method described in the paper "DINOv2: Learning Robust Visual Features without Supervision" and the ViT approach described in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The original implementation of the model can be found on the GitHub repository of DINOv2.

Read more

$-/run

135.8K

Huggingface

efficientnet_b3.ra2_in1k

efficientnet_b3.ra2_in1k

efficientnet_b3.ra2_in1k is an image classification model that has been trained on the ImageNet-1k dataset. It is based on the EfficientNet architecture and uses the RandAugment RA2 recipe for data augmentation. The model has 12.2 million parameters and achieves a GMACs (giga multiply-accumulate operations) of 1.6. It has been trained using the RMSProp optimizer with exponential decay learning rate schedule and warmup. The model can be used for image classification tasks, feature map extraction, and image embeddings. It is compared to other models in the timm model results.

Read more

$-/run

56.9K

Huggingface

Similar creators