[](#yolos-tiny-sized-model)YOLOS (tiny-sized) model
===================================================

YOLOS model fine-tuned on COCO 2017 object detection (118k annotated images). It was introduced in the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Fang et al. and first released in [this repository](https://github.com/hustvl/YOLOS).

Disclaimer: The team releasing YOLOS did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

YOLOS is a Vision Transformer (ViT) trained using the DETR loss. Despite its simplicity, a base-sized YOLOS model is able to achieve 42 AP on COCO validation 2017 (similar to DETR and more complex frameworks such as Faster R-CNN).

The model is trained using a "bipartite matching loss": one compares the predicted classes + bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N (so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as bounding box). The Hungarian matching algorithm is used to create an optimal one-to-one mapping between each of the N queries and each of the N annotations. Next, standard cross-entropy (for the classes) and a linear combination of the L1 and generalized IoU loss (for the bounding boxes) are used to optimize the parameters of the model.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for object detection. See the [model hub](https://huggingface.co/models?search=hustvl/yolos) to look for all available YOLOS models.

### [](#how-to-use)How to use

Here is how to use this model:

    from transformers import YolosImageProcessor, YolosForObjectDetection
    from PIL import Image
    import torch
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    model = YolosForObjectDetection.from_pretrained('hustvl/yolos-tiny')
    image_processor = YolosImageProcessor.from_pretrained("hustvl/yolos-tiny")
    
    inputs = image_processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    
    # model predicts bounding boxes and corresponding COCO classes
    logits = outputs.logits
    bboxes = outputs.pred_boxes
    
    
    # print results
    target_sizes = torch.tensor([image.size[::-1]])
    results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]
    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        box = [round(i, 2) for i in box.tolist()]
        print(
            f"Detected {model.config.id2label[label.item()]} with confidence "
            f"{round(score.item(), 3)} at location {box}"
        )
    

Currently, both the feature extractor and model support PyTorch.

[](#training-data)Training data
-------------------------------

The YOLOS model was pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet2012) and fine-tuned on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively.

### [](#training)Training

The model was pre-trained for 300 epochs on ImageNet-1k and fine-tuned for 300 epochs on COCO.

[](#evaluation-results)Evaluation results
-----------------------------------------

This model achieves an AP (average precision) of **28.7** on COCO 2017 validation. For more details regarding evaluation results, we refer to the original paper.

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-2106-00666,
      author    = {Yuxin Fang and
                   Bencheng Liao and
                   Xinggang Wang and
                   Jiemin Fang and
                   Jiyang Qi and
                   Rui Wu and
                   Jianwei Niu and
                   Wenyu Liu},
      title     = {You Only Look at One Sequence: Rethinking Transformer in Vision through
                   Object Detection},
      journal   = {CoRR},
      volume    = {abs/2106.00666},
      year      = {2021},
      url       = {https://arxiv.org/abs/2106.00666},
      eprinttype = {arXiv},
      eprint    = {2106.00666},
      timestamp = {Fri, 29 Apr 2022 19:49:16 +0200},
      biburl    = {https://dblp.org/rec/journals/corr/abs-2106-00666.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `yolos-tiny` model is a lightweight object detection model based on the YOLOS architecture. It was fine-tuned on the COCO 2017 object detection dataset, which contains 118k annotated images. The `yolos-tiny` model is a Vision Transformer (ViT) trained using the DETR loss, which is a simple yet effective approach for object detection. Despite its simplicity, the base-sized YOLOS model can achieve 42 AP on the COCO validation set, on par with more complex frameworks like Faster R-CNN.

The YOLOS model uses a "bipartite matching loss" to train the object detection heads. It compares the predicted classes and bounding boxes of each of the 100 object queries to the ground truth annotations, using the Hungarian matching algorithm to create an optimal one-to-one mapping. It then optimizes the model parameters using standard cross-entropy loss for the classes and a combination of L1 and generalized IoU loss for the bounding boxes.

Compared to similar models like [DETR](https://aimodels.fyi/models/huggingFace/detr-resnet-50-facebook) and [YOLO-world](https://aimodels.fyi/models/huggingFace/yolo-world-zsxkib), the `yolos-tiny` model stands out for its small size and strong performance on the COCO dataset.

## Model inputs and outputs

### Inputs
- **Images**: The model takes in individual images as input, which are expected to be processed and resized to a fixed size.

### Outputs
- **Object Logits**: The model outputs class logits for each of the 100 object queries.
- **Bounding Boxes**: The model outputs bounding box coordinates for each of the 100 object queries.

## Capabilities

The `yolos-tiny` model can be used for real-time object detection in images. It is able to detect a wide variety of objects from the COCO dataset, including common household items, animals, and vehicles. The model's compact size makes it suitable for deployment on edge devices and mobile applications.

## What can I use it for?

You can use the `yolos-tiny` model for a variety of object detection tasks, such as:

- **Surveillance and security**: Detect and track objects of interest in real-time video feeds.
- **Autonomous vehicles**: Identify and localize objects like pedestrians, cars, and traffic signals to enable safe navigation.
- **Robotics and automation**: Integrate the model into robotic systems to enable interaction with and manipulation of objects in the environment.
- **Retail and inventory management**: Monitor product stocks and detect misplaced items in stores and warehouses.

See the [model hub](https://huggingface.co/models?search=hustvl/yolos) to explore other available YOLOS models that may fit your specific use case.

## Things to try

One interesting aspect of the YOLOS architecture is its use of object queries to detect objects in the image. This approach is different from traditional object detection frameworks that rely on pre-defined anchor boxes or region proposals. By directly predicting the class and bounding box for each object query, the YOLOS model can potentially be more efficient and flexible in handling a variable number of objects in an image.

You could experiment with the model's performance on different types of images, such as scenes with a large number of objects or images with significant occlusion or clutter. Evaluating the model's robustness and adaptability to diverse real-world scenarios would help understand its strengths and limitations.

Additionally, you could investigate ways to further optimize the `yolos-tiny` model for deployment on resource-constrained devices, such as by exploring model quantization or distillation techniques.