[](#multi-crop-llava-3b)Multi-crop LLaVA-3b
===========================================

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M)

[](#model-details)Model details
-------------------------------

Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens for one image, we create K<<N tokens for M<N parts of the image (crops)? It would allow us to get visual information from small parts of the image and not inflate the number of image "tokens" too much. I called this method multi-crop LLaVA (MC-LLaVA).

You can read more about the model in the [blog post](https://huggingface.co/blog/visheratin/vlm-resolution-curse).

MC-LLaVA-3b was fine-tuned from [Phi-2 merge](/visheratin/MC-LLaVA-3b/blob/main/vince62s/phi-2-psy) using vision tower from [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384).

As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:

    <|im_start|>user
    {prompt}<|im_end|>
    <|im_start|>assistant
    

[](#how-to-use)How to use
-------------------------

    from transformers import AutoModel, AutoProcessor
    import torch
    
    model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda")
    
    processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True)
    
    with torch.inference_mode():
        inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728)
        output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False,
            eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id)
    
    result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")
    print(result)
    

[](#benchmarks)Benchmarks
-------------------------

*   TextVQA - 50.9%
*   GQA - 59.5%
*   VQAv2 - 76.72%
*   VizWiz - 32.68%
*   V\*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57%

[](#examples)Examples
---------------------

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sXDvVl5s9fTcE0N2bQGOlXhnNlKEdeun)

[](#license)License
-------------------

The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service. Which means don't create competitor models for them.

[](#acknowledgments)Acknowledgments
-----------------------------------

Thanks to [Lambda](https://lambdalabs.com/) for providing a machine to train the model.

Thanks to [ML Collective](https://mlcollective.org/) for continuous support and providing compute resources for testing the model.

## Model overview

The `MC-LLaVA-3b` is a multimodal AI model developed by visheratin that combines a large language model (LLM) with a vision tower for tasks involving both text and images. It is based on the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which uses a Vision Transformer (ViT) to encode image information and aligns it with a large language model. Unlike traditional LLaVA models that generate a fixed number of image "tokens", the `MC-LLaVA-3b` creates a smaller number of tokens for multiple image crops, which allows it to capture visual information more efficiently.

The model was fine-tuned from the [Phi-2 merge](https://huggingface.co/visheratin/MC-LLaVA-3b/blob/main/vince62s/phi-2-psy) using a vision tower from the [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384) model. It uses the ChatML prompt format, which is a common format for chatbot-style interactions.

## Model inputs and outputs

### Inputs
- **Prompt**: A text prompt that the model will use to generate a response.
- **Image**: One or more images that the model will use to inform its response.

### Outputs
- **Generated text**: The model's response to the input prompt, which may incorporate information from the provided image(s).

## Capabilities

The `MC-LLaVA-3b` model has been evaluated on a variety of multimodal benchmarks, including TextVQA, GQA, VQAv2, VizWiz, and V*-bench. It achieves strong performance, with scores ranging from 32.68% on VizWiz to 76.72% on VQAv2. The model's ability to efficiently extract visual information from image crops allows it to perform well on tasks that require understanding the contents of an image.

## What can I use it for?

The `MC-LLaVA-3b` model can be used for a variety of multimodal tasks, such as:

- **Image captioning**: Generating descriptive text to summarize the contents of an image.
- **Visual question answering**: Answering questions about the contents of an image.
- **Multimodal chatbots**: Building conversational agents that can understand and respond to both text and visual inputs.

The model's performance on benchmarks suggests that it could be a useful tool for applications that involve analyzing and understanding visual information, such as in the fields of education, e-commerce, or customer service.

## Things to try

One interesting aspect of the `MC-LLaVA-3b` model is its use of a "multi-crop" approach to image encoding, which allows it to capture visual information more efficiently than traditional LLaVA models. You could experiment with this approach by generating responses to prompts that require a deep understanding of an image's contents, and compare the results to a model that uses a more straightforward image encoding method. This could help you gain insights into the tradeoffs and benefits of the multi-crop approach.

Another area to explore could be the model's performance on different types of multimodal tasks, such as visual question answering, image captioning, or even multimodal language generation. By testing the model on a variety of tasks, you may uncover its strengths and limitations, and identify areas where further improvements could be made.