[](#llava-model-card)LLaVA Model Card
=====================================

[](#model-details)Model details
-------------------------------

**Model type:** LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: [NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)

**Model date:** LLaVA-v1.6-34B was trained in December 2023.

**Paper or resources for more information:** [https://llava-vl.github.io/](https://llava-vl.github.io/)

[](#license)License
-------------------

[NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) license.

**Where to send questions or comments about the model:** [https://github.com/haotian-liu/LLaVA/issues](https://github.com/haotian-liu/LLaVA/issues)

[](#intended-use)Intended use
-----------------------------

**Primary intended uses:** The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

[](#training-dataset)Training dataset
-------------------------------------

*   558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
*   158K GPT-generated multimodal instruction-following data.
*   500K academic-task-oriented VQA data mixture.
*   50K GPT-4V data mixture.
*   40K ShareGPT data.

[](#evaluation-dataset)Evaluation dataset
-----------------------------------------

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

## Model overview

The `llava-v1.6-34b` is an open-source chatbot developed by [liuhaotian](https://aimodels.fyi/creators/huggingFace/liuhaotian) that is trained by fine-tuning a large language model (LLM) on multimodal instruction-following data. It is based on the transformer architecture and uses the [NousResearch/Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as its base LLM. 

The model is part of the LLaVA family, which includes similar versions like [llava-v1.5-13b](https://aimodels.fyi/models/huggingFace/llava-v15-13b-liuhaotian), [llava-v1.5-7b](https://aimodels.fyi/models/huggingFace/llava-v15-7b-liuhaotian), [llava-v1.6-mistral-7b](https://aimodels.fyi/models/huggingFace/llava-v16-mistral-7b-liuhaotian), and [LLaVA-13b-delta-v0](https://aimodels.fyi/models/huggingFace/llava-13b-delta-v0-liuhaotian). These models differ in their base LLM, training dataset, and model size.

## Model inputs and outputs

### Inputs
- The model accepts natural language instructions and prompts as input.
- It can also accept image data as input for multimodal tasks.

### Outputs
- The model generates human-like responses in natural language.
- For multimodal tasks, the model can generate relevant images as output.

## Capabilities

The `llava-v1.6-34b` model has been trained to engage in a wide range of tasks, including natural language processing, computer vision, and multimodal reasoning. It has shown strong performance on tasks such as answering complex questions, following detailed instructions, and generating relevant images.

## What can I use it for?

The primary use of the `llava-v1.6-34b` model is for research on large multimodal models and chatbots. It can be particularly useful for researchers and hobbyists working in computer vision, natural language processing, machine learning, and artificial intelligence. 

Some potential use cases for the model include:
- Building chatbots and virtual assistants with multimodal capabilities
- Developing visual question answering systems
- Exploring new techniques for instruction-following in language models
- Advancing research on multimodal reasoning and understanding

## Things to try

One interesting aspect of the `llava-v1.6-34b` model is its ability to combine text and image data to perform complex tasks. Researchers could experiment with using the model to generate images based on textual descriptions, or to answer questions that require both visual and linguistic understanding.

Another area to explore is the model's performance on tasks that require strong reasoning and problem-solving skills, such as scientific question answering or task-oriented dialogue. By probing the model's capabilities in these areas, researchers can gain valuable insights into the strengths and limitations of large multimodal language models.