Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

## Overview

- Transformers have emerged as a powerful tool for learning visual representations
- Researchers identified and characterized artifacts in feature maps of both supervised and self-supervised Vision Transformer (ViT) networks
- These artifacts correspond to high-norm tokens appearing in low-informative background areas during inference, which are repurposed for internal computations
- The researchers propose a simple solution to fix this issue by providing additional tokens to the input sequence of the Vision Transformer

## Plain English Explanation

Transformers are a type of machine learning model that have proven to be very effective at learning visual representations from images and other visual data. In this research paper, the authors identify and describe certain "artifacts" or issues that they found in the internal feature maps of both supervised and self-supervised [Vision Transformer](https://aimodels.fyi/papers/arxiv/vision-transformers-domain-adaptation-generalization-study-robustness) (ViT) networks.

These artifacts appear as high-magnitude tokens (mathematical representations of visual information) that the model seems to be using for internal computations, even though they are coming from low-informative background areas of the images, rather than the main objects or scenes of interest. The researchers propose a simple solution to this problem - by adding additional "filler" tokens to the input sequence of the Vision Transformer, they are able to eliminate these artifacts entirely, for both supervised and self-supervised models.

This solution not only fixes the issue, but also [sets a new state-of-the-art](https://aimodels.fyi/papers/arxiv/enhancing-efficiency-vision-transformer-networks-design-techniques) for self-supervised visual models on dense visual prediction tasks. It also enables improved [object discovery methods](https://aimodels.fyi/papers/arxiv/understanding-video-transformers-via-universal-concept-discovery) with larger models, and leads to smoother, more intuitive feature maps and attention maps for downstream visual processing.

## Technical Explanation

The researchers conducted an in-depth analysis of the feature maps and internal representations learned by both supervised and self-supervised Vision Transformer (ViT) models. They identified the presence of high-norm tokens in the feature maps, particularly in low-informative background areas of the input images, which were being repurposed by the models for internal computations.

To address this issue, the researchers proposed a simple yet effective solution - they added extra "filler" tokens to the input sequence of the ViT models, which gave the models additional capacity to handle background information without repurposing the main visual tokens. This approach was shown to fix the artifact problem entirely for both supervised and self-supervised ViT models.

Furthermore, the researchers demonstrated that this solution [enables more efficient and effective ViT models](https://aimodels.fyi/papers/arxiv/fastervit-fast-vision-transformers-hierarchical-attention), setting a new state-of-the-art for self-supervised visual models on dense visual prediction tasks. It also improved the performance of [object discovery methods](https://aimodels.fyi/papers/arxiv/understanding-video-transformers-via-universal-concept-discovery) when using larger ViT models, and led to smoother feature maps and attention maps that are more intuitive for downstream visual processing tasks.

## Critical Analysis

The researchers acknowledge that while their solution is simple and effective, it does not address the underlying cause of the artifact issue in ViT models. They suggest that further research is needed to understand the [learning dynamics and correlation structures](https://aimodels.fyi/papers/arxiv/learning-correlation-structures-vision-transformers) that lead to these artifacts in the first place.

Additionally, while the proposed solution fixes the artifact problem, it does not necessarily guarantee that the internal representations learned by the models will be optimal for all downstream tasks. There may be other limitations or tradeoffs that were not explored in this study, and further investigation into the impact of the solution on a wider range of applications would be valuable.

Overall, this research provides an important contribution to the understanding and improvement of Vision Transformer models, but there is still room for further exploration and refinement of the techniques used to enhance the efficiency and effectiveness of these powerful visual learning models.

## Conclusion

This research paper identifies and addresses a significant issue in the internal representations learned by Vision Transformer models, both supervised and self-supervised. By adding additional "filler" tokens to the input sequence, the researchers were able to eliminate the artifacts in the feature maps, leading to improved performance on dense visual prediction tasks and enabling more effective object discovery methods.

The implications of this work extend beyond the specific models and tasks studied, as it highlights the importance of carefully examining the internal workings of complex machine learning models to identify and address potential issues. As the field of computer vision continues to advance, solutions like the one presented in this paper will be crucial for developing more robust and reliable visual AI systems.