[](#1-introduction)1\. Introduction
-----------------------------------

Introducing DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

[DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/abs/2403.05525)

[**Github Repository**](https://github.com/deepseek-ai/DeepSeek-VL)

Haoyu Lu\*, Wen Liu\*, Bo Zhang\*\*, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan (\*Equal Contribution, \*\*Project Lead)

[![](https://github.com/deepseek-ai/DeepSeek-VL/blob/main/images/sample.jpg)](https://github.com/deepseek-ai/DeepSeek-VL/blob/main/images/sample.jpg)

### [](#2-model-summary)2\. Model Summary

DeepSeek-VL-7b-base uses the [SigLIP-L](https://huggingface.co/timm/ViT-L-16-SigLIP-384) and [SAM-B](https://huggingface.co/facebook/sam-vit-base) as the hybrid vision encoder supporting 1024 x 1024 image input and is constructed based on the DeepSeek-LLM-7b-base which is trained on an approximate corpus of 2T text tokens. The whole DeepSeek-VL-7b-base model is finally trained around 400B vision-language tokens. DeekSeel-VL-7b-chat is an instructed version based on [DeepSeek-VL-7b-base](https://huggingface.co/deepseek-ai/deepseek-vl-7b-base).

[](#3-quick-start)3\. Quick Start
---------------------------------

### [](#installation)Installation

On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:

    git clone https://github.com/deepseek-ai/DeepSeek-VL
    cd DeepSeek-VL
    
    pip install -e .
    

### [](#simple-inference-example)Simple Inference Example

    import torch
    from transformers import AutoModelForCausalLM
    
    from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
    from deepseek_vl.utils.io import load_pil_images
    
    
    # specify the path to the model
    model_path = "deepseek-ai/deepseek-vl-7b-chat"
    vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
    tokenizer = vl_chat_processor.tokenizer
    
    vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
    vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
    
    conversation = [
        {
            "role": "User",
            "content": "<image_placeholder>Describe each stage of this image.",
            "images": ["./images/training_pipelines.png"]
        },
        {
            "role": "Assistant",
            "content": ""
        }
    ]
    
    # load images and prepare for inputs
    pil_images = load_pil_images(conversation)
    prepare_inputs = vl_chat_processor(
        conversations=conversation,
        images=pil_images,
        force_batchify=True
    ).to(vl_gpt.device)
    
    # run image encoder to get the image embeddings
    inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
    
    # run the model to get the response
    outputs = vl_gpt.language_model.generate(
        inputs_embeds=inputs_embeds,
        attention_mask=prepare_inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        do_sample=False,
        use_cache=True
    )
    
    answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
    print(f"{prepare_inputs['sft_format'][0]}", answer)
    

### [](#cli-chat)CLI Chat

    
    python cli_chat.py --model_path "deepseek-ai/deepseek-vl-7b-chat"
    
    # or local path
    python cli_chat.py --model_path "local model path"
    

[](#4-license)4\. License
-------------------------

This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of DeepSeek-VL Base/Chat models is subject to [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL). DeepSeek-VL series (including Base and Chat) supports commercial use.

[](#5-citation)5\. Citation
---------------------------

    @misc{lu2024deepseekvl,
          title={DeepSeek-VL: Towards Real-World Vision-Language Understanding}, 
          author={Haoyu Lu and Wen Liu and Bo Zhang and Bingxuan Wang and Kai Dong and Bo Liu and Jingxiang Sun and Tongzheng Ren and Zhuoshu Li and Yaofeng Sun and Chengqi Deng and Hanwei Xu and Zhenda Xie and Chong Ruan},
          year={2024},
          eprint={2403.05525},
          archivePrefix={arXiv},
          primaryClass={cs.AI}
    }
    

[](#6-contact)6\. Contact
-------------------------

If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).

## Model overview

`deepseek-vl-7b-chat` is an instructed version of the `deepseek-vl-7b-base` model, which is an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. The `deepseek-vl-7b-base` model uses the [SigLIP-L](https://huggingface.co/timm/ViT-L-16-SigLIP-384) and [SAM-B](https://huggingface.co/facebook/sam-vit-base) as the hybrid vision encoder, and is constructed based on the `deepseek-llm-7b-base` model, which is trained on an approximate corpus of 2T text tokens. The whole `deepseek-vl-7b-base` model is finally trained around 400B vision-language tokens.

The `deepseek-vl-7b-chat` model is an instructed version of the `deepseek-vl-7b-base` model, making it capable of engaging in real-world vision and language understanding applications, including processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.

## Model inputs and outputs

### Inputs
- **Image**: The model can take images as input, supporting a resolution of up to 1024 x 1024.
- **Text**: The model can also take text as input, allowing for multimodal understanding and interaction.

### Outputs
- **Text**: The model can generate relevant and coherent text responses based on the provided image and/or text inputs.
- **Bounding Boxes**: The model can also output bounding boxes, enabling it to localize and identify objects or regions of interest within the input image.

## Capabilities

`deepseek-vl-7b-chat` has impressive capabilities in tasks such as visual question answering, image captioning, and multimodal understanding. For example, the model can accurately describe the content of an image, answer questions about it, and even draw bounding boxes around relevant objects or regions.

## What can I use it for?

The `deepseek-vl-7b-chat` model can be utilized in a variety of real-world applications that require vision and language understanding, such as:

- **Content Moderation**: The model can be used to analyze images and text for inappropriate or harmful content.
- **Visual Assistance**: The model can help visually impaired users by describing images and answering questions about their contents.
- **Multimodal Search**: The model can be used to develop search engines that can understand and retrieve relevant information from both text and visual sources.
- **Education and Training**: The model can be used to create interactive educational materials that combine text and visuals to enhance learning.

## Things to try

One interesting thing to try with `deepseek-vl-7b-chat` is its ability to engage in multi-round conversations about images. By providing the model with an image and a series of follow-up questions or prompts, you can explore its understanding of the visual content and its ability to reason about it over time. This can be particularly useful for tasks like visual task planning, where the model needs to comprehend the scene and take multiple steps to achieve a goal.

Another interesting aspect to explore is the model's performance on specialized tasks like formula recognition or scientific literature understanding. By providing it with relevant inputs, you can assess its capabilities in these domains and see how it compares to more specialized models.