[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

**Model date:** LLaVA-v1.5-13B was trained in September 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   450K academic-task-oriented VQA data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

`llava-v1.5-13b` is an open-source chatbot trained by fine-tuning [LLaMA](https://github.com/facebookresearch/llama) and [Vicuna](https://openai.com/research/vicuna) on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture. The model was trained and released by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian), a prominent AI researcher. Similar models include the smaller `llava-v1.5-7b`, the fine-tuned `llava-v1.5-7B-GGUF`, and the `LLaVA-13b-delta-v0` delta model.

## Model inputs and outputs

`llava-v1.5-13b` is a multimodal language model that can process both text and images. It takes in a prompt containing both text and the `<image>` tag, and generates relevant text output in response.

### Inputs
- Text prompt containing the `<image>` tag
- One or more images

### Outputs
- Relevant text output generated in response to the input prompt and image(s)

## Capabilities

`llava-v1.5-13b` excels at tasks involving multimodal understanding and instruction-following. It can answer questions about images, generate image captions, and perform complex reasoning over both text and visual inputs. The model has been evaluated on a variety of benchmarks, including academic VQA datasets and recent instruction-following datasets, and has demonstrated strong performance.

## What can I use it for?

The primary intended uses of `llava-v1.5-13b` are research on large multimodal models and chatbots. Researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence can use the model to explore and develop new techniques in these domains. The model's capabilities in multimodal understanding and instruction-following make it a valuable tool for applications such as visual question answering, image captioning, and interactive AI assistants.

## Things to try

One interesting aspect of `llava-v1.5-13b` is its ability to handle multiple images and prompts simultaneously. Users can experiment with providing the model with a prompt that references several images and see how it generates responses that integrate information from the different visual inputs. Additionally, the model's strong performance on instruction-following tasks suggests opportunities for exploring interactive, task-oriented applications that leverage its understanding of natural language and visual cues.

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: [NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)

**Model date:** LLaVA-v1.6-34B was trained in December 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

[NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) license.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   500K academic-task-oriented VQA data mixture.
*   50K GPT-4V data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

The `llava-v1.6-34b` is an open-source chatbot developed by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian) that is trained by fine-tuning a large language model (LLM) on multimodal instruction-following data. It is based on the transformer architecture and uses the [NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as its base LLM. 

The model is part of the LLaVA family, which includes similar versions like [llava-v1.5-13b](https://aimodels.fyi/models/huggingFace/llava-v15-13b-liuhaotian), [llava-v1.5-7b](https://aimodels.fyi/models/huggingFace/llava-v15-7b-liuhaotian), [llava-v1.6-mistral-7b](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-liuhaotian), and [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian). These models differ in their base LLM, training dataset, and model size.

## Model inputs and outputs

### Inputs
- The model accepts natural language instructions and prompts as input.
- It can also accept image data as input for multimodal tasks.

### Outputs
- The model generates human-like responses in natural language.
- For multimodal tasks, the model can generate relevant images as output.

## Capabilities

The `llava-v1.6-34b` model has been trained to engage in a wide range of tasks, including natural language processing, computer vision, and multimodal reasoning. It has shown strong performance on tasks such as answering complex questions, following detailed instructions, and generating relevant images.

## What can I use it for?

The primary use of the `llava-v1.6-34b` model is for research on large multimodal models and chatbots. It can be particularly useful for researchers and hobbyists working in computer vision, natural language processing, machine learning, and artificial intelligence. 

Some potential use cases for the model include:
- Building chatbots and virtual assistants with multimodal capabilities
- Developing visual question answering systems
- Exploring new techniques for instruction-following in language models
- Advancing research on multimodal reasoning and understanding

## Things to try

One interesting aspect of the `llava-v1.6-34b` model is its ability to combine text and image data to perform complex tasks. Researchers could experiment with using the model to generate images based on textual descriptions, or to answer questions that require both visual and linguistic understanding.

Another area to explore is the model's performance on tasks that require strong reasoning and problem-solving skills, such as scientific question answering or task-oriented dialogue. By probing the model's capabilities in these areas, researchers can gain valuable insights into the strengths and limitations of large multimodal language models.

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

**Model date:** LLaVA-v1.5-7B was trained in September 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   450K academic-task-oriented VQA data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

`llava-v1.5-7b` is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture. The model was created by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian), and similar models include [llava-v1.5-7B-GGUF](https://aimodels.fyi/models/huggingFace/llava-v15-7b-gguf-jartine), [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian), [llava-v1.6-mistral-7b](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-liuhaotian), and [llava-1.5-7b-hf](https://aimodels.fyi/models/huggingFace/llava-15-7b-hf-llava-hf).

## Model inputs and outputs

`llava-v1.5-7b` is a large language model that can take in textual prompts and generate relevant responses. The model is particularly designed for multimodal tasks, allowing it to process and generate text based on provided images.

### Inputs
- Textual prompts in the format "USER: <prompt>\nASSISTANT:"
- Optional image data, indicated by the `<image>` token in the prompt

### Outputs
- Generated text responses relevant to the given prompt and image (if provided)

## Capabilities

`llava-v1.5-7b` can perform a variety of tasks, including:
- Open-ended conversation
- Answering questions about images
- Generating captions for images
- Providing detailed descriptions of scenes and objects
- Assisting with creative writing and ideation

The model's multimodal capabilities allow it to understand and generate text based on both textual and visual inputs.

## What can I use it for?

`llava-v1.5-7b` can be a powerful tool for researchers and hobbyists working on projects related to computer vision, natural language processing, and artificial intelligence. Some potential use cases include:
- Building interactive chatbots and virtual assistants
- Developing image captioning and visual question answering systems
- Enhancing text generation models with multimodal understanding
- Exploring the intersection of language and vision in AI

By leveraging the model's capabilities, you can create innovative applications that combine language and visual understanding.

## Things to try

One interesting thing to try with `llava-v1.5-7b` is its ability to handle multi-image and multi-prompt generation. This means you can provide multiple images in a single prompt and the model will generate a response that considers all the visual inputs. This can be particularly useful for tasks like visual reasoning or complex scene descriptions.

Another intriguing aspect of the model is its potential for synergy with other large language models, such as GPT-4. As mentioned in the [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian) model card, the combination of `llava-v1.5-7b` and GPT-4 set a new state-of-the-art on the ScienceQA dataset. Exploring these types of model combinations and their capabilities can lead to exciting advancements in the field of multimodal AI.

**NOTE: This "delta model" cannot be used directly.**  
Users have to apply it on top of the original LLaMA weights to get actual LLaVA weights.  
See [https://github.com/haotian-liu/LLaVA#llava-weights](https://github.com/haotian-liu/LLaVA#llava-weights) for instructions.

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

**Model date:** LLaVA was trained in April 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

**License:** Apache License 2.0

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

595K filtered image-text pairs from CC3M. 150K GPT-generated multimodal instruction-following data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A preliminary evaluation of the model quality is conducted by creating a set of 90 visual reasoning questions from 30 unique images randomly sampled from COCO val 2014 and each is associated with three types of questions: conversational, detailed description, and complex reasoning. We utilize GPT-4 to judge the model outputs. We also evaluate our model on the ScienceQA dataset. Our synergy with GPT-4 sets a new state-of-the-art on the dataset. See [https://llava-vl.github.io/](https://llava-vl.github.io/) for more details.

## Model Overview

The `LLaVA-13b-delta-v0` model is an open-source chatbot trained by fine-tuning the LLaMA language model and Vicuna on GPT-generated multimodal instruction-following data. It is an autoregressive language model based on the transformer architecture. The model was developed by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian), who has also created similar models such as [llava-v1.6-mistral-7b](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-liuhaotian) and [llava-med-7b-delta](https://aimodels.fyi/models/huggingFace/llava-med-7b-delta-microsoft).

## Model Inputs and Outputs

The `LLaVA-13b-delta-v0` model is a language model that can generate human-like text given a prompt. It also has multimodal capabilities, allowing it to generate text based on both textual and visual inputs. 

### Inputs
- **Text prompts**: The model can accept text prompts to generate relevant responses.
- **Images**: The model can also accept images as part of the input, allowing it to generate text describing or relating to the provided image.

### Outputs
- **Textual responses**: The primary output of the model is human-like textual responses to the provided prompts or image-text combinations.

## Capabilities

The `LLaVA-13b-delta-v0` model has been trained to engage in open-ended conversation, answer questions, and describe images. It demonstrates strong language understanding and generation capabilities, as well as the ability to reason about and describe visual information. The model can be particularly useful for research on large multimodal models and chatbots.

## What Can I Use It For?

The primary intended use of the `LLaVA-13b-delta-v0` model is for research on large multimodal models and chatbots. Researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence may find this model useful for exploring various multimodal applications and advancing the state of the art in these fields.

## Things to Try

Some interesting things to try with the `LLaVA-13b-delta-v0` model include:

- Evaluating the model's ability to understand and describe complex visual scenes by providing it with a diverse set of images.
- Exploring the model's language understanding and generation capabilities by engaging it in open-ended conversations on a variety of topics.
- Investigating the model's reasoning abilities by asking it to answer questions that require combining information from both text and visual inputs.
- Experimenting with different prompting strategies to see how the model's responses can be tailored for specific use cases or applications.

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

**Model date:** LLaVA-v1.6-Mistral-7B was trained in December 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) license.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   500K academic-task-oriented VQA data mixture.
*   50K GPT-4V data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

The `llava-v1.6-mistral-7b` is an open-source chatbot model developed by [Haotian Liu](https://aimodels.fyi/creators/huggingFace/liuhaotian) that combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. It is an auto-regressive language model based on the transformer architecture, fine-tuned on a diverse dataset of image-text pairs and multimodal instruction-following data.

The model builds upon the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) base model, which provides improved commercial licensing and bilingual support compared to earlier versions. Additionally, the training dataset for `llava-v1.6-mistral-7b` has been expanded to include more diverse and high-quality data, as well as support for dynamic high-resolution image input.

Similar models include the [llava-v1.6-mistral-7b-hf](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-hf-llava-hf) and [llava-1.5-7b-hf](https://aimodels.fyi/models/huggingFace/llava-15-7b-hf-llava-hf) checkpoints, which offer slightly different model configurations and training datasets.

## Model inputs and outputs

### Inputs
- **Text prompt**: The model takes a text prompt as input, which can include instructions, questions, or other natural language text.
- **Image**: The model can also take an image as input, which is integrated into the text prompt using the `<image>` token.

### Outputs
- **Text response**: The model generates a relevant text response to the input prompt, in an auto-regressive manner.

## Capabilities

The `llava-v1.6-mistral-7b` model is capable of handling a variety of multimodal tasks, such as image captioning, visual question answering, and open-ended dialogue. It can understand and reason about the content of images, and generate coherent and contextually appropriate responses.

## What can I use it for?

You can use the `llava-v1.6-mistral-7b` model for research on large multimodal models and chatbots, or for building practical applications that require visual understanding and language generation, such as intelligent virtual assistants, image-based search, or interactive educational tools.

## Things to try

One interesting aspect of the `llava-v1.6-mistral-7b` model is its ability to handle dynamic high-resolution image input. You could experiment with providing higher-quality images to the model and observe how it affects the quality and level of detail in the generated responses.

Additionally, you could explore the model's performance on specialized benchmarks for instruction-following language models, such as the collection of 12 benchmarks mentioned in the model description, to better understand its strengths and limitations in this domain.

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)

**Model date:** LLaVA-v1.6-Vicuna-7B was trained in December 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   500K academic-task-oriented VQA data mixture.
*   50K GPT-4V data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

`llava-v1.6-vicuna-7b` is an open-source chatbot model developed by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian). It is a large language model (LLM) based on the Transformer architecture, trained by fine-tuning the [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) model on a diverse multimodal dataset. Similar models include the `llava-v1.5-7b`, `llava-v1.5-13b`, `llava-v1.6-34b`, `llava-v1.5-7B-GGUF`, and `llava-v1.6-mistral-7b` models, also developed by liuhaotian and his team.

## Model inputs and outputs

`llava-v1.6-vicuna-7b` is a text-to-text model, taking natural language input and generating coherent text responses. The model is trained on a variety of datasets, including image-text pairs, multimodal instruction-following data, academic VQA tasks, and conversational data. This gives the model broad capabilities to engage in open-ended dialogue, answer questions, and follow instructions across different domains.

### Inputs
- Natural language text prompts
- Multimodal inputs like images (when combined with text)

### Outputs
- Coherent text responses
- Answers to questions
- Completion of instructions

## Capabilities

`llava-v1.6-vicuna-7b` demonstrates strong performance on a range of language tasks, including open-ended conversation, question answering, and task completion. The model can engage in fluent dialogue, provide informative responses, and follow multi-step instructions. It also exhibits some multimodal capabilities, allowing it to interpret and reason about visual information when paired with text.

## What can I use it for?

The primary intended uses of `llava-v1.6-vicuna-7b` are research on large multimodal models and chatbots. The model can be used by researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence to explore the capabilities and limitations of such systems. Potential applications include virtual assistants, content generation, and task automation, though the model is not intended for commercial use.

## Things to try

Experiment with `llava-v1.6-vicuna-7b` to see how it handles open-ended dialogue, question answering, and instruction following across different domains. Try providing the model with multimodal inputs, such as images paired with text, to see how it can leverage visual information. Explore the model's strengths and weaknesses, and compare its performance to similar models like the `llava-v1.5-7b` or `llava-v1.6-mistral-7b`.

**NOTE: This is a research preview of the LLaVA-Lightning based on MPT-7B-chat checkpoint. The usage of the model should comply with MPT-7B-chat license and agreements.**

**NOTE: Unlike other LLaVA models, this model can (should) be used directly without delta weights conversion!**

  
  

[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna/MPT on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

**Model date:** LLaVA-Lightning-MPT was trained in May 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

**License:** CC-BY-NC-SA 4.0

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. 80K GPT-generated multimodal instruction-following data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A preliminary evaluation of the model quality is conducted by creating a set of 90 visual reasoning questions from 30 unique images randomly sampled from COCO val 2014 and each is associated with three types of questions: conversational, detailed description, and complex reasoning. We utilize GPT-4 to judge the model outputs. We also evaluate our model on the ScienceQA dataset. Our synergy with GPT-4 sets a new state-of-the-art on the dataset. See [https://llava-vl.github.io/](https://llava-vl.github.io/) for more details.

## Model overview

`LLaVA-Lightning-MPT-7B-preview` is a research preview of the LLaVA model, which is an open-source chatbot trained by fine-tuning the LLaMA/Vicuna/MPT language models on GPT-generated multimodal instruction-following data. This model is based on the MPT-7B-chat checkpoint and can be used directly without needing to apply delta weights. Unlike other LLaVA models, this preview version does not require the additional conversion step. The primary use of LLaVA is research on large multimodal models and chatbots, with the target audience being researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

## Model inputs and outputs

`LLaVA-Lightning-MPT-7B-preview` is an auto-regressive language model that can engage in multimodal tasks. It takes in a combination of text and visual inputs and generates relevant text outputs.

### Inputs
- Text prompts for conversational, detailed description, and complex reasoning tasks
- Images associated with the prompts

### Outputs
- Textual responses that demonstrate the model's understanding and reasoning about the provided inputs

## Capabilities

`LLaVA-Lightning-MPT-7B-preview` has been evaluated on a set of 90 visual reasoning questions, where it demonstrated strong performance in conversational, detailed description, and complex reasoning tasks. The model has also been evaluated on the ScienceQA dataset, where it achieved state-of-the-art results in synergy with GPT-4.

## What can I use it for?

The primary intended use of `LLaVA-Lightning-MPT-7B-preview` is for research on large multimodal models and chatbots. Researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence can explore the model's capabilities and use it as a testbed for further advancements in these fields.

## Things to try

Researchers can experiment with fine-tuning the `LLaVA-Lightning-MPT-7B-preview` model on specific datasets or tasks to explore its adaptability and performance. Additionally, users can compare the model's behavior and outputs with other similar models, such as [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian) and [llava-v1.5-7b](https://aimodels.fyi/models/huggingFace/llava-v15-7b-liuhaotian), to gain a deeper understanding of the model's strengths and limitations.