[](#nanollava---sub-1b-vision-language-model)nanoLLaVA - Sub 1B Vision-Language Model
=====================================================================================

![Logo](https://i.postimg.cc/d15k3YNG/nanollava.webp)

[](#description)Description
---------------------------

nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.

*   **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B)
*   **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

Model

**VQA v2**

**TextVQA**

**ScienceQA**

**POPE**

**MMMU (Test)**

**MMMU (Eval)**

**GQA**

**MM-VET**

Score

70.84

46.71

58.97

84.1

28.6

30.4

54.79

23.9

[](#training-data)Training Data
-------------------------------

Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

[](#finetuning-code)Finetuning Code
-----------------------------------

Coming Soon!!!

[](#usage)Usage
---------------

You can use with `transformers` with the following script:

    pip install -U transformers accelerate flash_attn
    

    import torch
    import transformers
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from PIL import Image
    import warnings
    
    # disable some warnings
    transformers.logging.set_verbosity_error()
    transformers.logging.disable_progress_bar()
    warnings.filterwarnings('ignore')
    
    # set device
    torch.set_default_device('cuda')  # or 'cpu'
    
    # create model
    model = AutoModelForCausalLM.from_pretrained(
        'qnguyen3/nanoLLaVA',
        torch_dtype=torch.float16,
        device_map='auto',
        trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(
        'qnguyen3/nanoLLaVA',
        trust_remote_code=True)
    
    # text prompt
    prompt = 'Describe this image in detail'
    
    messages = [
        {"role": "user", "content": f'<image>\n{prompt}'}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    print(text)
    
    text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
    input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)
    
    # image, sample images can be found in images folder
    image = Image.open('/path/to/image.png')
    image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
    
    # generate
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        max_new_tokens=2048,
        use_cache=True)[0]
    
    print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
    

[](#prompt-format)Prompt Format
-------------------------------

The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`:

    <|im_start|>system
    Answer the question<|im_end|><|im_start|>user
    <image>
    What is the picture about?<|im_end|><|im_start|>assistant
    

* * *

Image

Example

[![small](/qnguyen3/nanoLLaVA/resolve/main/example_1.png)](/qnguyen3/nanoLLaVA/blob/main/example_1.png)

**What is the text saying?**  
"Small but mighty".  
**How does the text correlate to the context of the image?**  
The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar.

* * *

## Model overview

`nanoLLaVA` is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. It is based on the [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) base language model and the [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder. Similar models include the [Qwen-VL](https://aimodels.fyi/models/huggingFace/qwen-vl-qwen) series from Alibaba Cloud, which are large vision-language models with a range of capabilities.

## Model inputs and outputs

### Inputs
- **Text prompt**: A text prompt describing the image to be processed
- **Image**: An image to be analyzed and described

### Outputs
- **Multimodal description**: A detailed description of the image, grounding relevant objects and their relationships

## Capabilities

The `nanoLLaVA` model has demonstrated strong performance on a variety of vision-language tasks, including visual question answering, text-based VQA, science QA, and referring expression comprehension. It achieves SOTA results on several benchmarks while maintaining a compact model size suitable for edge deployment.

## What can I use it for?

The `nanoLLaVA` model can be used for a variety of applications that require efficiently integrating vision and language understanding, such as:

- **Intelligent assistants**: Providing detailed descriptions and answering questions about visual content
- **Accessibility tools**: Generating alt text and captions for images to improve accessibility
- **Automated reporting**: Summarizing visual observations and insights from images or documents
- **Visual search and retrieval**: Enabling multimodal search and browsing of image databases

## Things to try

Experiment with the `nanoLLaVA` model on a range of visual and multimodal tasks beyond the standard benchmarks. Explore its few-shot or zero-shot capabilities to see how it can adapt to novel scenarios without extensive fine-tuning. Additionally, investigate ways to optimize its performance and efficiency for your specific use cases.

[](#nanollava-15---improved-sub-1b-vision-language-model)nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model
==============================================================================================================

![Logo](https://i.postimg.cc/d15k3YNG/nanollava.webp)

[](#description)Description
---------------------------

nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version [qnguyen3/nanoLLaVA](https://huggingface.co/qnguyen3/nanoLLaVA)

*   **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B)
*   **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

Model

**VQA v2**

**TextVQA**

**ScienceQA**

**POPE**

**MMMU (Test)**

**MMMU (Eval)**

**GQA**

**MM-VET**

nanoLLavA-1.0

70.84

46.71

58.97

84.1

28.6

30.4

54.79

23.9

nanoLLavA-1.5

TBD

TBD

TBD

TBD

TBD

TBD

TBD

TBD

[](#training-data)Training Data
-------------------------------

Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

[](#finetuning-code)Finetuning Code
-----------------------------------

Coming Soon!!!

[](#usage)Usage
---------------

You can use with `transformers` with the following script:

    pip install -U transformers accelerate flash_attn
    

    import torch
    import transformers
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from PIL import Image
    import warnings
    
    # disable some warnings
    transformers.logging.set_verbosity_error()
    transformers.logging.disable_progress_bar()
    warnings.filterwarnings('ignore')
    
    # set device
    torch.set_default_device('cuda')  # or 'cpu'
    
    model_name = 'qnguyen3/nanoLLaVA-1.5'
    
    # create model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map='auto',
        trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True)
    
    # text prompt
    prompt = 'Describe this image in detail'
    
    messages = [
        {"role": "user", "content": f'<image>\n{prompt}'}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    print(text)
    
    text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
    input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)
    
    # image, sample images can be found in images folder
    image = Image.open('/path/to/image.png')
    image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
    
    # generate
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        max_new_tokens=2048,
        use_cache=True)[0]
    
    print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
    

[](#prompt-format)Prompt Format
-------------------------------

The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`:

    <|im_start|>system
    Answer the question<|im_end|><|im_start|>user
    <image>
    What is the picture about?<|im_end|><|im_start|>assistant

## Model overview

`nanoLLaVA-1.5` is an improved sub-1 billion parameter vision-language model created by [qnguyen3](https://aimodels.fyi/creators/huggingFace/qnguyen3). It builds upon the previous `nanoLLaVA` model by utilizing a more powerful base language model, [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1), and a high-quality vision encoder, [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384). This allows `nanoLLaVA-1.5` to achieve improved performance across a variety of multimodal benchmarks compared to its predecessor, while still maintaining a compact model size suitable for edge device deployment.

## Model inputs and outputs

### Inputs
- **Text prompt**: A text prompt describing an image, typically in a conversational format.
- **Image**: An image that the model will use to generate a description.

### Outputs
- **Image description**: A detailed textual description of the provided image, generated by the model.

## Capabilities

`nanoLLaVA-1.5` is capable of generating detailed, coherent descriptions of images across a wide range of subject matter. It has demonstrated strong performance on benchmarks such as VQA v2, TextVQA, ScienceQA, POPE, MMMU, GQA, and MM-VET, surpassing the previous `nanoLLaVA` model in many areas.

## What can I use it for?

`nanoLLaVA-1.5` can be used in a variety of applications that involve understanding and describing visual content, such as:

- **Image captioning**: Automatically generating captions for images in applications like social media, e-commerce, or content management.
- **Visual question answering**: Answering questions about the contents of an image in a conversational interface.
- **Multimodal chatbots**: Building intelligent chatbots that can understand and respond to both text and visual inputs.

## Things to try

One interesting aspect of `nanoLLaVA-1.5` is its compact size and ability to run efficiently on edge devices. This makes it well-suited for applications where low-latency, on-device inference is important, such as in mobile apps or embedded systems. Developers can explore ways to integrate `nanoLLaVA-1.5` into their projects and leverage its multimodal capabilities to create innovative user experiences.