Get a weekly rundown of the latest AI models and research... subscribe!

Clip Vision Bert Vqa Ft 6k



CLIP-Vision-BERT Multilingual VQA Model Fine-tuned CLIP-Vision-BERT on translated VQAv2 image-text pairs using sequence classification objective. We translate the dataset to three other languages other than English: French, German, and Spanish using the MarianMT Models. This model is based on the VisualBERT which was introduced in this paper and first released in this repository. The output is 3129 class logits, the same classes as used by VisualBERT authors. The initial weights are loaded from the Conceptual-12M 60k checkpoints. We trained the CLIP-Vision-BERT VQA model during community week hosted by Huggingface πŸ€— using JAX/Flax. Model description CLIP-Vision-BERT is a modified BERT model which takes in visual embeddings from the CLIP-Vision transformer and concatenates them with BERT textual embeddings before passing them to the self-attention layers of BERT. This is done for deep cross-modal interaction between the two modes. Intended uses & limitations❗️ This model is fine-tuned on a multi-translated version of the visual question answering task - VQA v2. Since VQAv2 is a dataset scraped from the internet, it will involve some biases which will also affect all fine-tuned versions of this model. How to use❓ You can use this model directly on visual question answering. You will need to clone the model from here. An example of usage is shown below: Training data πŸ‹πŸ»β€β™‚οΈ The CLIP-Vision-BERT model was fine-tuned on the translated version of the VQAv2 dataset in four languages using Marian: English, French, German and Spanish. Hence, the dataset is four times the original English questions. The dataset questions and image URLs/paths can be downloaded from flax-community/multilingual-vqa. Data Cleaning 🧹 Though the original dataset contains 443,757 train and 214,354 validation image-question pairs. We only use the multiple_choice_answer. The answers which are not present in the 3129 classes are mapped to the <unk> label. Splits We use the original train-val splits from the VQAv2 dataset. After translation, we get 1,775,028 train image-text pairs, and 857,416 validation image-text pairs. Training procedure πŸ‘¨πŸ»β€πŸ’» Preprocessing The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of approximately 110,000. The beginning of a new document is marked with [CLS] and the end of one by [SEP]. Fine-tuning The checkpoint of the model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 6k steps with a per device batch size of 128 and a max sequence length of 128. The optimizer used is AdamW with a learning rate of 5e-5, learning rate warmup for 1600 steps, and linear decay of the learning rate after. We tracked experiments using TensorBoard. Here is link to main dashboard: CLIP Vision BERT VQAv2 Fine-tuning Dashboard The model at this checkpoint reached eval accuracy of 0.49 on our multilingual VQAv2 dataset. Team Members Gunjan Chhablani @gchhablani Bhavitvya Malik@bhavitvyamalik Acknowledgements We thank Nilakshan Kunananthaseelan for helping us whenever he could get a chance. We also thank Abheesht Sharma for helping in the discussions in the initial phases. Luke Melas helped us get the CC-12M data on our TPU-VMs and we are very grateful to him. This project would not be possible without the help of Patrick and Suraj who met with us frequently and helped review our approach and guided us throughout the project. Huge thanks to Huggingface πŸ€— & Google Jax/Flax team for such a wonderful community week and for answering our queries on the Slack channel, and for providing us with the TPU-VMs.


Cost per run
Avg run time

Creator Models

Spanish T5 Small$?153,163
Clip Rsicd V2$?25,197
T5 Base Openwebtext$?18
Bert Base Uncased Swahili$?61
T5 Base Wikisplit$?301

Similar Models

Try it!

You can use this area to play around with demo applications that incorporate the Clip Vision Bert Vqa Ft 6k model. These demos are maintained and hosted externally by third-party creators. If you see an error, message me on Twitter.

Currently, there are no demos available for this model.


Summary of this model and related resources.

Model NameClip Vision Bert Vqa Ft 6k

CLIP-Vision-BERT Multilingual VQA Model Fine-tuned CLIP-Vis...

Model LinkView on HuggingFace
API SpecView on HuggingFace
Github LinkNo Github link provided
Paper LinkNo paper link provided


How popular is this model, by number of runs? How popular is the creator, by the sum of all their runs?

Model Rank
Creator Rank


How much does it cost to run this model? How long, on average, does it take to complete a run?

Cost per Run$-
Prediction Hardware-
Average Completion Time-