_This model was added by Hugging Face staff._

**NOTE: This "delta model" cannot be used directly.** Users have to apply it on top of the original LLaMA weights to get actual LLaVA weights.

[](#llava-med-large-language-and-vision-assistant-for-biomedicine)LLaVA-Med: Large Language and Vision Assistant for BioMedicine
================================================================================================================================

_Visual instruction tuning towards buiding large language and vision models with GPT-4 level capabilities in the biomedicine space._

\[[Paper, NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)](https://arxiv.org/abs/2306.00890)\] | \[[LLaVA-Med Github Repository](https://github.com/microsoft/LLaVA-Med)\]

[Chunyuan Li\*](https://chunyuan.li/), [Cliff Wong\*](https://scholar.google.com/citations?user=Sl05ifcAAAAJ&hl=en), [Sheng Zhang\*](https://scholar.google.com/citations?user=-LVEXQ8AAAAJ&hl=en), [Naoto Usuyama](https://www.microsoft.com/en-us/research/people/naotous/), [Haotian Liu](https://hliu.cc), [Jianwei Yang](https://jwyang.github.io/), [Tristan Naumann](https://scholar.google.com/citations?user=cjlSeqwAAAAJ&hl=en), [Hoifung Poon](https://scholar.google.com/citations?user=yqqmVbkAAAAJ&hl=en), [Jianfeng Gao](https://scholar.google.com/citations?user=CQ1cqKkAAAAJ&hl=en) (\*Equal Contribution)

![](https://github.com/microsoft/LLaVA-Med/blob/main/images/llava_med_logo.png?raw=true)  

_Generated by [GLIGEN](https://gligen.github.io/) using the grounded inpainting mode, with three boxes: `white doctor coat`, `stethoscope`, `white doctor hat with a red cross sign`._

![](https://github.com/microsoft/LLaVA-Med/blob/main/images/llava_med_pipeline.png?raw=true)  

_LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks._

[![Code License](https://img.shields.io/badge/Code%20License-Microsoft%20Research-red)](/microsoft/llava-med-7b-delta/blob/main/Research%20License.docx) [![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://creativecommons.org/licenses/by-nc/4.0/deed.en) **Usage and License Notices**: The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: LLaMA, Vicuna and GPT-4 respectively. The data is made available under CC BY NC 4.0. The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

[](#model-description)Model Description
---------------------------------------

Large Language and Vision Assistant for bioMedicine (i.e., LLaVA-Med) is a large language and vision model trained using a curriculum learning method for adapting LLaVA to the biomedical domain. It is an open-source release intended for research use only to facilitate reproducibility of the corresponding paper which claims improved performance for open-ended biomedical questions answering tasks, including common visual question answering (VQA) benchmark datasets such as PathVQA and VQA-RAD.

### [](#model-uses)Model Uses

#### [](#intended-use)Intended Use

The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.

#### [](#primary-intended-use)Primary Intended Use

The primary intended use is to support AI researchers reproducing and building on top of this work. LLaVA-Med and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions.

#### [](#out-of-scope-use)Out-of-Scope Use

**Any** deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended _for research use only_ and not intended for deployed use cases. Please refer to [the associated paper](https://aka.ms/llava-med) for more details.

### [](#data)Data

This model builds upon [PMC-15M dataset](https://aka.ms/biomedclip-paper), which is a large-scale parallel image-text dataset for biomedical vision-language processing. It contains 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more.

### [](#limitations)Limitations

This model was developed using English corpora, and thus may be considered English-only. This model is evaluated on a narrow set of biomedical benchmark tasks, described in [LLaVA-Med paper](https://aka.ms/llava-med). As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived, [LLaVA](https://llava-vl.github.io/).

Further, this model was developed in part using the [PMC-15M](https://aka.ms/biomedclip-paper) dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.

[](#install)Install
-------------------

1.  Clone the [LLaVA-Med Github repository](https://github.com/microsoft/LLaVA-Med) and navigate to LLaVA-Med folder

    https://github.com/microsoft/LLaVA-Med.git
    cd LLaVA-Med
    

2.  Install Package: Create conda environment

    conda create -n llava-med python=3.10 -y
    conda activate llava-med
    pip install --upgrade pip  # enable PEP 660 support
    

3.  Install additional packages for training cases

    pip uninstall torch torchvision -y
    pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
    pip install openai==0.27.8
    pip uninstall transformers -y
    pip install git+https://github.com/huggingface/transformers@cae78c46
    pip install -e .
    

    pip install einops ninja open-clip-torch
    pip install flash-attn --no-build-isolation
    

[](#serving)Serving
-------------------

The model weights above are _delta_ weights. The usage of LLaVA-Med checkpoints should comply with the base LLM's model license: [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).

Instructions:

1.  Download the delta weights.
2.  Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
3.  Use the following scripts to get LLaVA-Med weights by applying our delta. In the script below, set the --delta argument to the path of the unzipped `llava_med_in_text_60k_delta` directory. It can be adapted for other delta weights by changing the `--delta` argument (and base/target accordingly).

    python3 -m llava.model.apply_delta \
        --base /path/to/llama-7b \
        --target /output/path/to/llava_med_in_text_60k \
        --delta path/to/llava_med_in_text_60k_delta
    

[](#evaluation)Evaluation
-------------------------

### [](#medical-visual-chat-gpt-assisted-evaluation)Medical Visual Chat (GPT-assisted Evaluation)

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

1.  Generate LLaVA-Med responses

    python model_vqa.py \
        --model-name ./checkpoints/LLaVA-7B-v0 \
        --question-file data/eval/llava_med_eval_qa50_qa.jsonl \
        --image-folder data/images/ \
        --answers-file /path/to/answer-file.jsonl
    

2.  Evaluate the generated responses. In our case, [`llava_med_eval_qa50_qa.jsonl`](/data/eval/llava_med_eval_qa50_qa.jsonl) contains the questions, context (captions and inline-mentions) and responses generated by text-only GPT-4 (0314), which we treat as ground truth.

    python llava/eval/eval_multimodal_chat_gpt_score.py \
        --question_input_path data/eval/llava_med_eval_qa50_qa.jsonl \
        --input_path /path/to/answer-file.jsonl \
        --output_path /path/to/save/gpt4-eval-for-individual-answers.jsonl
    

3.  Summarize the evaluation results

    python summarize_gpt_review.py
    

### [](#medical-vqa)Medical VQA

Three Medical VQA datasets are considered in our experiments, including VQA-Rad, SLAKE, Pathology-VQA. We use VQA-Rad as the running example to illustrate how LLaVA-Med is applied to a downstream scenario.

#### [](#--prepare-data)\- Prepare Data

1.  Please see VQA-Rad [repo](https://paperswithcode.com/dataset/vqa-rad) for setting up the dataset.
2.  Generate VQA-Rad dataset for LLaVA-Med conversation-style format (the same format with instruct tuning). For each dataset, we process it into three components: `train.json`, `test.json`, `images`.

#### [](#--fine-tuning)\- Fine-tuning

To achieve the higher performance for given a downstream dataset, the same full-model tuning script with instruct tuning is used to continue train LLaVA-Med.

Detailed script to fine-tune to downstream datasets: LLaVA-Med-7B, 8x A100 (40G). Time: ~1 hour.

    torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
        llava/train/train_mem.py \
        --model_name_or_path /path/to/checkpoint_llava_med_instruct_60k_inline_mention \
        --data_path /path/to/eval/vqa_rad/train.json \
        --image_folder /path/to/eval/vqa_rad/images \
        --vision_tower openai/clip-vit-large-patch14 \
        --mm_vision_select_layer -2 \
        --mm_use_im_start_end True \
        --bf16 True \
        --output_dir /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
        --num_train_epochs 3 \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 4 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 5000 \
        --save_total_limit 3 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --tf32 True \
        --fsdp "full_shard auto_wrap" \
        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
        --model_max_length 2048 \
        --gradient_checkpointing True \
        --lazy_preprocess True \
        --report_to wandb

#### [](#--evaluation)\- Evaluation

Depending on which checkpoint is employed in evaluation, zero-shot performance is reported on medical instruct tuned checkpoint (eg, [LLaVA-Med-7B](/path/to/checkpoint_llava_med_instruct_60k_inline_mention)), and fine-tuned performance is reported on checkpoint that has been further tuned on training set of the downstream datasets (eg, [LLaVA-Med-7B-VQA-Rad](/path/to/checkpoint_llava_med_instruct_60k_inline_mention/fine_tuned/vqa_rad) ).

(a) Generate LLaVA responses on ScienceQA dataset

(a.1). \[Option 1\] Multiple-GPU inference You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](/microsoft/llava-med-7b-delta/blob/main/scripts/chunyl/finetune_on_benchmarks/eval_med_dataset_batch.sh).

    python llava/eval/run_med_datasets_eval_batch.py --num-chunks 8  --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
        --question-file path/to/eval/vqa_rad/test.json \
        --image-folder path/to/eval/vqa_rad/images \
        --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
    

(a.2). \[Option 2\] Single-GPU inference

    python llava/eval/model_vqa_med.py --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
        --question-file path/to/eval/vqa_rad/test.json \
        --image-folder path/to/eval/vqa_rad/images \
        --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
    

(b) Evaluate the generated responses

(b.1). \[Option 1\] Evaluation for all three VQA datasets

    
    python llava/eval/run_eval_batch.py \
        --pred_file_parent_path /path/to/llava-med \
        --target_test_type test-answer-file
    

It collects the decoding results of all predictions files under the project path, computes the corresponding evaluation metrics, and outputs the results in "`eval_results_med_datasets.jsonl`". To analyze the score, we provdie ipython notebook [run\_eval\_metrics.ipynb](/microsoft/llava-med-7b-delta/blob/main/llava/notebook/run_eval_metrics.ipynb).

(b.2). \[Option 2\] Evaluation for on one specific VQA dataset

    python llava/eval/run_eval.py \
        --gt /path/to/eval/vqa_rad/test.json \
        --pred /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl
    

Please find the LLaVA-Med performance in [llava\_med\_performance.md](/microsoft/llava-med-7b-delta/blob/main/docs/llava_med_performance.md) or in the paper.

[](#acknowledgement)Acknowledgement
-----------------------------------

*   Our project is built upon [LLaVA](https://github.com/lm-sys/FastChat) and [Vicuna](https://github.com/lm-sys/FastChat): They provide our base models with the amazing multimodal and langauge capabilities, respectively!

If you find LLaVA-Med useful for your your research and applications, please cite using this BibTeX:

    @article{li2023llavamed,
      title={Llava-med: Training a large language-and-vision assistant for biomedicine in one day},
      author={Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng},
      journal={arXiv preprint arXiv:2306.00890},
      year={2023}
    }
    

[](#related-projects)Related Projects
-------------------------------------

*   [LLaVA](https://llava-vl.github.io/)
*   [BioMed CLIP](https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224)
*   [Instruction Tuning with GPT-4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

## Model overview

`llava-med-7b-delta` is a large language and vision assistant model focused on the biomedical domain. It was developed by researchers at Microsoft and is based on the LLaMA model. The model was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion, first on biomedical concept alignment and then on full-blown instruction tuning.

This model is similar to other medical-focused language models like [MedAlpaca 13b](https://aimodels.fyi/models/huggingFace/medalpaca-13b-medalpaca) and [MedAlpaca 7b](https://aimodels.fyi/models/huggingFace/medalpaca-7b-medalpaca), which are also fine-tuned on medical datasets to improve performance on tasks like question answering and medical dialogue. However, `llava-med-7b-delta` goes beyond text-only capabilities by incorporating visual understanding through its connection to the general-domain LLaVA model.

The model was also trained on the PMC-15M dataset, a large-scale parallel image-text dataset for biomedical vision-language processing, which is the same dataset used to train the [BiomedCLIP-PubMedBERT_256-vit_base_patch16_224](https://aimodels.fyi/models/huggingFace/biomedclip-pubmedbert256-vitbasepatch16224-microsoft) model.

## Model inputs and outputs

### Inputs
- **Images**: The model can accept images as input, enabling it to perform visual reasoning and understanding tasks in the biomedical domain.
- **Text**: The model can also accept text input, allowing it to engage in language-based interactions and tasks.

### Outputs
- **Text generation**: The model can generate relevant and coherent text in response to prompts, leveraging its biomedical knowledge.
- **Multimodal understanding**: The model can combine its understanding of both images and text to perform tasks like visual question answering or image captioning.

## Capabilities

`llava-med-7b-delta` exhibits strong performance on a variety of biomedical tasks, particularly those that require both language and visual understanding. For example, the model can accurately describe the contents of a medical image, answer questions about a radiological scan, or provide step-by-step instructions for a medical procedure.

The model's visual understanding capabilities are a key strength, allowing it to excel at tasks like interpreting medical images and diagrams. This sets it apart from language-only models that may struggle with visual inputs.

## What can I use it for?

Researchers and developers working on biomedical applications could use `llava-med-7b-delta` for a variety of projects, such as:

- **Medical image analysis**: The model could be used to build tools that analyze medical images, such as X-rays or MRI scans, and provide insights or recommendations.
- **Biomedical question answering**: The model could be integrated into chatbots or virtual assistants to answer questions about medical conditions, treatments, or procedures.
- **Multimodal medical education**: The model could be used to create interactive learning experiences that combine text, images, and video to teach medical concepts.

However, it's important to note that the model should only be used for research purposes and not for any clinical or deployed applications, as it has not been thoroughly tested for real-world use.

## Things to try

One interesting aspect of `llava-med-7b-delta` is its ability to combine visual and language understanding to tackle complex biomedical tasks. For example, you could try prompting the model with a medical image and asking it to provide a step-by-step explanation of the procedure or condition depicted. This would showcase the model's capacity to integrate its knowledge of both visual and textual information.

Another avenue to explore would be using the model for creative or exploratory tasks, such as generating medical illustrations or diagrams based on textual descriptions. This could inspire new ways of visualizing and communicating biomedical concepts.

Ultimately, the versatility of `llava-med-7b-delta` makes it a valuable tool for researchers and developers working to advance the state of the art in biomedical artificial intelligence.