[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

**Model date:** LLaVA-v1.5-7B was trained in September 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   450K academic-task-oriented VQA data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

The `llava-v1.5-7B-GGUF` model is an open-source chatbot trained by fine-tuning the LLaMA/Vicuna language model on a diverse dataset of GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture, developed by the researcher [jartine](https://aimodels.fyi/creators/huggingFace/jartine). The model was trained in September 2023 and is licensed under the LLAMA 2 Community License.

Similar models include the [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian), [llava-v1.6-mistral-7b](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-liuhaotian), [llava-1.5-7b-hf](https://aimodels.fyi/models/huggingFace/llava-15-7b-hf-llava-hf), and [ShareGPT4V-7B](https://aimodels.fyi/models/huggingFace/sharegpt4v-7b-lin-chen), all of which are multimodal chatbot models based on the LLaVA architecture.

## Model inputs and outputs

### Inputs
- **Image:** The model can process and generate responses based on provided images.
- **Text prompt:** The model takes in a text-based prompt, typically following a specific template, to generate a response.

### Outputs
- **Text response:** The model generates a text-based response based on the provided image and prompt.

## Capabilities

The `llava-v1.5-7B-GGUF` model is capable of performing a variety of multimodal tasks, such as image captioning, visual question answering, and instruction-following. It can generate coherent and relevant responses to prompts that involve both text and images, drawing on its training on a diverse dataset of multimodal instruction-following data.

## What can I use it for?

The primary use of the `llava-v1.5-7B-GGUF` model is for research on large multimodal models and chatbots. It can be utilized by researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence to explore the capabilities and limitations of such models. Additionally, the model's ability to process and respond to multimodal prompts could be leveraged in various applications, such as chatbots, virtual assistants, and educational tools.

## Things to try

One interesting aspect of the `llava-v1.5-7B-GGUF` model is its potential to combine visual and textual information in novel ways. Experimenters could try providing the model with prompts that involve both images and text, and observe how it synthesizes the information to generate relevant and coherent responses. Additionally, users could explore the model's capabilities in handling complex or ambiguous prompts, or prompts that require reasoning about the content of the image.