[](#vision-transformer-base-sized-model-trained-using-dinov2)Vision Transformer (base-sized model) trained using DINOv2
=======================================================================================================================

Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).

Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion.

Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a \[CLS\] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

Note that this model does not include any fine-tuned heads.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the \[CLS\] token, as the last hidden state of this token can be seen as a representation of an entire image.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for feature extraction. See the [model hub](https://huggingface.co/models?search=facebook/dinov2) to look for fine-tuned versions on a task that interests you.

### [](#how-to-use)How to use

Here is how to use this model:

    from transformers import AutoImageProcessor, AutoModel
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
    model = AutoModel.from_pretrained('facebook/dinov2-base')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    misc{oquab2023dinov2,
          title={DINOv2: Learning Robust Visual Features without Supervision}, 
          author={Maxime Oquab and Timothe Darcet and Tho Moutakanni and Huy Vo and Marc Szafraniec and Vasil Khalidov and Pierre Fernandez and Daniel Haziza and Francisco Massa and Alaaeldin El-Nouby and Mahmoud Assran and Nicolas Ballas and Wojciech Galuba and Russell Howes and Po-Yao Huang and Shang-Wen Li and Ishan Misra and Michael Rabbat and Vasu Sharma and Gabriel Synnaeve and Hu Xu and Herv Jegou and Julien Mairal and Patrick Labatut and Armand Joulin and Piotr Bojanowski},
          year={2023},
          eprint={2304.07193},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
    }

## Model overview

The `dinov2-base` model is a Vision Transformer (ViT) model trained using the DINOv2 self-supervised learning method. It was developed by researchers at [Facebook](https://aimodels.fyi/creators/huggingFace/facebook). The DINOv2 method allows the model to learn robust visual features without direct supervision, by pre-training on a large collection of images. This contrasts with models like [dino-vitb16](https://aimodels.fyi/models/huggingFace/dino-vitb16-facebook) and [vit-base-patch16-224-in21k](https://aimodels.fyi/models/huggingFace/vit-base-patch16-224-in21k-google), which were trained in a supervised fashion on ImageNet.

## Model inputs and outputs

The `dinov2-base` model takes images as input and outputs a sequence of hidden feature representations. These features can then be used for a variety of downstream computer vision tasks, such as image classification, object detection, or visual question answering.

### Inputs
- **Images**: The model accepts images as input, which are divided into a sequence of fixed-size patches and linearly embedded.

### Outputs
- **Image feature representations**: The final output of the model is a sequence of hidden feature representations, where each feature corresponds to a patch in the input image. These features can be used for further processing in downstream tasks.

## Capabilities

The `dinov2-base` model is a powerful pre-trained vision model that can be used as a feature extractor for a wide range of computer vision applications. Because it was trained in a self-supervised manner on a large dataset of images, the model has learned robust visual representations that can be effectively transferred to various tasks, even with limited labeled data.

## What can I use it for?

You can use the `dinov2-base` model for feature extraction in your computer vision projects. By feeding your images through the model and extracting the final hidden representations, you can leverage the model's powerful visual understanding for tasks like image classification, object detection, and visual question answering. This can be particularly useful when you have a small dataset and want to leverage the model's pre-trained knowledge.

## Things to try

One interesting aspect of the `dinov2-base` model is its self-supervised pre-training approach, which allows it to learn visual features without the need for expensive manual labeling. You could experiment with fine-tuning the model on your own dataset, or using the pre-trained features as input to a custom downstream model. Additionally, you could compare the performance of the `dinov2-base` model to other self-supervised and supervised vision models, such as [dino-vitb16](https://aimodels.fyi/models/huggingFace/dino-vitb16-facebook) and [vit-base-patch16-224-in21k](https://aimodels.fyi/models/huggingFace/vit-base-patch16-224-in21k-google), to see how the different pre-training approaches impact performance on your specific task.