[](#segformer-b0-sized-model-fine-tuned-on-ade20k)SegFormer (b0-sized) model fine-tuned on ADE20k
=================================================================================================

SegFormer model fine-tuned on ADE20k at resolution 512x512. It was introduced in the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Xie et al. and first released in [this repository](https://github.com/NVlabs/SegFormer).

Disclaimer: The team releasing SegFormer did not write a model card for this model so this model card has been written by the Hugging Face team.

[](#model-description)Model description
---------------------------------------

SegFormer consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on semantic segmentation benchmarks such as ADE20K and Cityscapes. The hierarchical Transformer is first pre-trained on ImageNet-1k, after which a decode head is added and fine-tuned altogether on a downstream dataset.

[](#intended-uses--limitations)Intended uses & limitations
----------------------------------------------------------

You can use the raw model for semantic segmentation. See the [model hub](https://huggingface.co/models?other=segformer) to look for fine-tuned versions on a task that interests you.

### [](#how-to-use)How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

    from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
    from PIL import Image
    import requests
    
    processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
    model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits  # shape (batch_size, num_labels, height/4, width/4)
    

For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/segformer.html#).

### [](#license)License

The license for this model can be found [here](https://github.com/NVlabs/SegFormer/blob/master/LICENSE).

### [](#bibtex-entry-and-citation-info)BibTeX entry and citation info

    @article{DBLP:journals/corr/abs-2105-15203,
      author    = {Enze Xie and
                   Wenhai Wang and
                   Zhiding Yu and
                   Anima Anandkumar and
                   Jose M. Alvarez and
                   Ping Luo},
      title     = {SegFormer: Simple and Efficient Design for Semantic Segmentation with
                   Transformers},
      journal   = {CoRR},
      volume    = {abs/2105.15203},
      year      = {2021},
      url       = {https://arxiv.org/abs/2105.15203},
      eprinttype = {arXiv},
      eprint    = {2105.15203},
      timestamp = {Wed, 02 Jun 2021 11:46:42 +0200},
      biburl    = {https://dblp.org/rec/journals/corr/abs-2105-15203.bib},
      bibsource = {dblp computer science bibliography, https://dblp.org}
    }

## Model overview

The `segformer-b0-finetuned-ade-512-512` model is a version of the SegFormer model fine-tuned on the ADE20k dataset for semantic segmentation. SegFormer is a convolutional neural network architecture that uses a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve strong results on semantic segmentation benchmarks. This particular model was pre-trained on ImageNet-1k and then fine-tuned on the ADE20k dataset at a resolution of 512x512.

The SegFormer architecture is similar to the Vision Transformer (ViT) in that it treats an image as a sequence of patches and uses a Transformer encoder to process them. However, SegFormer uses a more efficient hierarchical design and a lightweight decode head, making it simpler and faster than traditional semantic segmentation models. The [segformer-b2-clothes](https://aimodels.fyi/models/huggingFace/segformerb2clothes-mattmdjaga) model is another example of a SegFormer variant fine-tuned for a specific task, in this case clothes segmentation.

## Model inputs and outputs

### Inputs
- **Images**: The model takes in images as its input, which are split into a sequence of fixed-size patches that are then linearly embedded and processed by the Transformer encoder.

### Outputs
- **Segmentation maps**: The model outputs a segmentation map, where each pixel is assigned a class label corresponding to the semantic category it belongs to (e.g., person, car, building, etc.). The resolution of the output segmentation map is lower than the input image resolution, typically by a factor of 4.

## Capabilities

The `segformer-b0-finetuned-ade-512-512` model is capable of performing semantic segmentation, which is the task of assigning a semantic label to each pixel in an image. It can accurately identify and delineate the various objects, scenes, and regions present in an image. This makes it useful for applications like autonomous driving, scene understanding, and image editing.

## What can I use it for?

This SegFormer model can be used for a variety of semantic segmentation tasks, such as:

- **Autonomous Driving**: Identify and segment different objects on the road (cars, pedestrians, traffic signs, etc.) to enable self-driving capabilities.
- **Scene Understanding**: Understand the composition of a scene by segmenting it into different semantic regions (sky, buildings, vegetation, etc.), which can be useful for applications like robotics and augmented reality.
- **Image Editing**: Perform precise segmentation of objects in an image, allowing for selective editing, masking, and manipulation of specific elements.

The [model hub](https://huggingface.co/models?other=segformer) provides access to a range of SegFormer models fine-tuned on different datasets, so you can explore options that best suit your specific use case.

## Things to try

One interesting aspect of the SegFormer architecture is its hierarchical Transformer encoder, which allows it to capture features at multiple scales. This enables the model to understand the context and relationships between different semantic elements in an image, leading to more accurate and detailed segmentation.

To see this in action, you could try using the `segformer-b0-finetuned-ade-512-512` model on a diverse set of images, ranging from indoor scenes to outdoor landscapes. Observe how the model is able to segment the various objects, textures, and regions in the images, and how the segmentation maps evolve as you move up the hierarchy of the Transformer encoder.