[](#face-parsing)Face Parsing
=============================

[![example image and output](/jonathandinu/face-parsing/resolve/main/demo.png)](/jonathandinu/face-parsing/blob/main/demo.png)

[Semantic segmentation](https://huggingface.co/docs/transformers/tasks/semantic_segmentation) model fine-tuned from [nvidia/mit-b5](https://huggingface.co/nvidia/mit-b5) with [CelebAMask-HQ](https://github.com/switchablenorms/CelebAMask-HQ) for face parsing. For additional options, see the Transformers [Segformer docs](https://huggingface.co/docs/transformers/model_doc/segformer).

> ONNX model for web inference contributed by [Xenova](https://huggingface.co/Xenova).

[](#usage-in-python)Usage in Python
-----------------------------------

Exhaustive list of labels can be extracted from [config.json](https://huggingface.co/jonathandinu/face-parsing/blob/65972ac96180b397f86fda0980bbe68e6ee01b8f/config.json#L30).

id

label

note

0

background

1

skin

2

nose

3

eye\_g

eyeglasses

4

l\_eye

left eye

5

r\_eye

right eye

6

l\_brow

left eyebrow

7

r\_brow

right eyebrow

8

l\_ear

left ear

9

r\_ear

right ear

10

mouth

area between lips

11

u\_lip

upper lip

12

l\_lip

lower lip

13

hair

14

hat

15

ear\_r

earring

16

neck\_l

necklace

17

neck

18

cloth

clothing

    import torch
    from torch import nn
    from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
    
    from PIL import Image
    import matplotlib.pyplot as plt
    import requests
    
    # convenience expression for automatically determining device
    device = (
        "cuda"
        # Device for NVIDIA or AMD GPUs
        if torch.cuda.is_available()
        else "mps"
        # Device for Apple Silicon (Metal Performance Shaders)
        if torch.backends.mps.is_available()
        else "cpu"
    )
    
    # load models
    image_processor = SegformerImageProcessor.from_pretrained("jonathandinu/face-parsing")
    model = SegformerForSemanticSegmentation.from_pretrained("jonathandinu/face-parsing")
    model.to(device)
    
    # expects a PIL.Image or torch.Tensor
    url = "https://images.unsplash.com/photo-1539571696357-5a69c17a67c6"
    image = Image.open(requests.get(url, stream=True).raw)
    
    # run inference on image
    inputs = image_processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    logits = outputs.logits  # shape (batch_size, num_labels, ~height/4, ~width/4)
    
    # resize output to match input image dimensions
    upsampled_logits = nn.functional.interpolate(logits,
                    size=image.size[::-1], # H x W
                    mode='bilinear',
                    align_corners=False)
    
    # get label masks
    labels = upsampled_logits.argmax(dim=1)[0]
    
    # move to CPU to visualize in matplotlib
    labels_viz = labels.cpu().numpy()
    plt.imshow(labels_viz)
    plt.show()
    

[](#usage-in-the-browser-transformersjs)Usage in the browser (Transformers.js)
------------------------------------------------------------------------------

    import {
      pipeline,
      env,
    } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.14.0";
    
    // important to prevent errors since the model files are likely remote on HF hub
    env.allowLocalModels = false;
    
    // instantiate image segmentation pipeline with pretrained face parsing model
    model = await pipeline("image-segmentation", "jonathandinu/face-parsing");
    
    // async inference since it could take a few seconds
    const output = await model(url);
    
    // each label is a separate mask object
    // [
    //   { score: null, label: 'background', mask: transformers.js RawImage { ... }}
    //   { score: null, label: 'hair', mask: transformers.js RawImage { ... }}
    //    ...
    // ]
    for (const m of output) {
      print(`Found ${m.label}`);
      m.mask.save(`${m.label}.png`);
    }
    

### [](#p5js)p5.js

Since [p5.js](https://p5js.org/) uses an animation loop abstraction, we need to take care loading the model and making predictions.

    // ...
    
    // asynchronously load transformers.js and instantiate model
    async function preload() {
      // load transformers.js library with a dynamic import
      const { pipeline, env } = await import(
        "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.14.0"
      );
    
      // important to prevent errors since the model files are remote on HF hub
      env.allowLocalModels = false;
    
      // instantiate image segmentation pipeline with pretrained face parsing model
      model = await pipeline("image-segmentation", "jonathandinu/face-parsing");
    
      print("face-parsing model loaded");
    }
    
    // ...
    

[full p5.js example](https://editor.p5js.org/jonathan.ai/sketches/wZn15Dvgh)

### [](#model-description)Model Description

*   **Developed by:** [Jonathan Dinu](https://twitter.com/jonathandinu)
*   **Model type:** Transformer-based semantic segmentation image model
*   **License:** non-commercial research and educational purposes
*   **Resources for more information:** Transformers docs on [Segformer](https://huggingface.co/docs/transformers/model_doc/segformer) and/or the [original research paper](https://arxiv.org/abs/2105.15203).

[](#limitations-and-bias)Limitations and Bias
---------------------------------------------

### [](#bias)Bias

While the capabilities of computer vision models are impressive, they can also reinforce or exacerbate social biases. The [CelebAMask-HQ](https://github.com/switchablenorms/CelebAMask-HQ) dataset used for fine-tuning is large but not necessarily perfectly diverse or representative. Also, they are images of.... just celebrities.

## Model overview

The `face-parsing` model is a semantic segmentation model fine-tuned from the `nvidia/mit-b5` model using the [CelebAMask-HQ](https://github.com/switchablenorms/CelebAMask-HQ) dataset for face parsing. It can segment faces into 18 different parts, including skin, nose, eyes, eyebrows, ears, mouth, hair, hat, earring, necklace, neck, and clothing. This model can be useful for applications such as virtual makeup, face editing, and facial analysis.

Similar models include the [segformer_b2_clothes](https://aimodels.fyi/models/huggingFace/segformer-b2-clothes-mattmdjaga) model, which is fine-tuned for clothes segmentation, and the [segformer-b0-finetuned-ade-512-512](https://aimodels.fyi/models/huggingFace/segformer-b0-finetuned-ade-512-512-nvidia) model, which is a SegFormer model fine-tuned on the ADE20k dataset for general semantic segmentation.

## Model inputs and outputs

### Inputs
- **Image**: The model takes a single image as input, which can be in the form of a PIL.Image, torch.Tensor, or a URL pointing to an image.

### Outputs
- **Segmentation mask**: The model outputs a segmentation mask, which is a tensor of shape `(batch_size, num_labels, height, width)`, where `num_labels` is the number of semantic labels (18 in this case).

## Capabilities

The `face-parsing` model can be used to segment faces into 18 different parts, including skin, nose, eyes, eyebrows, ears, mouth, hair, hat, earring, necklace, neck, and clothing. This can be useful for applications such as virtual makeup, face editing, and facial analysis. The model has been fine-tuned on the CelebAMask-HQ dataset, which contains high-quality face images, and can handle a wide range of face poses, expressions, and occlusions.

## What can I use it for?

The `face-parsing` model can be used for a variety of applications, such as:

- **Virtual makeup**: By segmenting the face into different parts, the model can be used to apply virtual makeup or other cosmetic effects to specific regions of the face.

- **Face editing**: The segmentation masks can be used to selectively edit or manipulate different parts of the face, such as changing the hair color or adding accessories.

- **Facial analysis**: The segmentation masks can be used to extract detailed information about the structure and appearance of the face, which can be useful for applications such as facial recognition, emotion analysis, or age estimation.

## Things to try

One interesting thing to try with the `face-parsing` model is to use it in combination with other computer vision models for more advanced facial analysis or manipulation tasks. For example, you could use the segmentation masks to guide the application of facial landmarks or facial expression recognition, or to selectively apply style transfer or image synthesis techniques to different parts of the face.

Another interesting direction to explore would be to fine-tune the model on different datasets or tasks, such as parsing faces in different cultural or demographic contexts, or extending the model to segment additional facial features or attributes.