[](#openflamingo-9b-deprecated)OpenFlamingo-9B (Deprecated)
===========================================================

**This early checkpoint was part of an initial release. It has since been deprecated in favor of [other checkpoints](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b) as part of the OpenFlamingo v2 release. However, it is possible to continue using this older checkpoint in the new codebase.**

* * *

[Blog post](https://laion.ai/blog/open-flamingo/) | [Code](https://github.com/mlfoundations/open_flamingo) | [Demo](https://7164d2142d11.ngrok.app)

OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) models. OpenFlamingo-9B is built off of [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) and [LLaMA-7B](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/). Before using this model please familiarize yourself with our [terms and conditions](https://github.com/mlfoundations/open_flamingo/blob/main/TERMS_AND_CONDITIONS.md).

[](#model-details)Model Details
-------------------------------

We freeze the pretrained vision encoder and language model, and then we train connecting Perceiver modules and cross-attention layers, following the original Flamingo paper.

Our training data is a mixture of [LAION 2B](https://huggingface.co/datasets/laion/laion2B-en) and a large interleaved image-text dataset called Multimodal C4, which will be released soon.

The current model is an early checkpoint of an ongoing effort. This checkpoint has seen 5 million interleaved image-text examples from Multimodal C4.

[](#uses)Uses
-------------

OpenFlamingo-9B is intended to be used **for academic research purposes only.** Commercial use is prohibited, in line with LLaMA's non-commercial license.

### [](#bias-risks-and-limitations)Bias, Risks, and Limitations

This model may generate inaccurate or offensive outputs, reflecting biases in its training data and pretrained priors.

In an effort to mitigate current potential biases and harms, we have deployed a content filter on model outputs in the OpenFlamingo demo. We continue to red-team the model to understand and improve its safety.

[](#evaluation)Evaluation
-------------------------

We've evaluated this checkpoint and report validation performance for two vision-language tasks: COCO captioning and VQAv2. Results are displayed below.

**COCO (CIDEr)**

0-shot

4-shot

8-shot

16-shot

32-shot

65.52

74.28

79.26

81.84

84.52

**VQAv2 (VQA accuracy)**

0-shot

4-shot

8-shot

16-shot

32-shot

43.55

44.05

47.5

48.87

50.34

## Model overview

The `OpenFlamingo-9B-deprecated` model is an open-source implementation of DeepMind's Flamingo models, built on top of the CLIP ViT-L/14 vision encoder and LLaMA-7B language model. As an early checkpoint in the ongoing development of OpenFlamingo, this model has been trained on a mixture of the LAION 2B and Multimodal C4 datasets. While this model has since been deprecated in favor of newer checkpoints, it is still possible to use it within the OpenFlamingo codebase.

## Model inputs and outputs

### Inputs
- Images and text prompts

### Outputs
- Image-conditioned text generation
- Visual question answering

## Capabilities

The `OpenFlamingo-9B-deprecated` model is capable of performing various vision-language tasks, such as image captioning and visual question answering. It has shown promising results on the COCO captioning and VQAv2 datasets, with performance improving as the number of few-shot examples increases.

## What can I use it for?

The `OpenFlamingo-9B-deprecated` model is intended for academic research purposes only and its commercial use is prohibited, in line with the LLaMA non-commercial license. Potential use cases include exploring multimodal AI systems, testing new vision-language architectures, and developing novel applications that combine computer vision and language understanding.

## Things to try

Researchers can experiment with fine-tuning the model on specialized datasets or tasks, or use it as a starting point for developing new vision-language models. The model's performance can also be further analyzed and compared to other state-of-the-art approaches in the field of multimodal AI.