OmniFusion

Maintainer: AIRI-Institute - Last updated 9/6/2024

📉

Model Overview

OmniFusion is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems. It integrates additional data modalities such as images, and potentially audio, 3D and video content. The model was developed by the AIRI-Institute and is based on the open source Mistral-7B core.

OmniFusion has two versions - the first uses one visual encoder CLIP-ViT-L, while the second uses two encoders (CLIP-ViT-L and Dino V2). The key component is an adapter mechanism that allows the language model to interpret and incorporate information from different modalities.

The model was trained on a diverse dataset covering tasks like image captioning, VQA, WebQA, OCRQA, and conversational QA. This training process involves two stages - first pretraining the adapter on image captioning, then unfreezing the Mistral model for improved understanding of dialog formats and complex queries.

Model Inputs and Outputs

Inputs

  • Text prompts
  • Images
  • Potentially audio, 3D and video content in the future

Outputs

  • Multimodal responses that synthesize information from various input modalities
  • Enhanced language understanding and generation capabilities compared to traditional text-only models

Capabilities

OmniFusion extends the capabilities of language models by enabling them to understand and generate responses that integrate information from multiple modalities. For example, the model can answer questions about the contents of an image, generate image captions, or engage in multimodal dialog that references both text and visual elements.

What Can I Use It For?

OmniFusion opens up new possibilities for multimodal applications, such as:

  • Intelligent image-based assistants that can answer questions and describe the contents of images
  • Multimodal chatbots that can engage in dialog referencing both text and visual information
  • Automated image captioning and description generation
  • Multimodal question answering systems that can reason about both text and visual input

Things to Try

Some interesting things to explore with OmniFusion include:

  • Providing the model with a diverse set of multimodal prompts (e.g. an image plus a text question) and observing how it integrates the information to generate a response
  • Evaluating the model's performance on specialized multimodal benchmarks or datasets to better understand its strengths and limitations
  • Experimenting with different ways of structuring the input (e.g. using custom tokens to mark visual data) to see how it impacts the model's multimodal reasoning capabilities
  • Investigating how OmniFusion compares to other multimodal models in terms of performance, flexibility, and ease of use for specific applications


This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Total Score

50

Follow @aimodelsfyi on 𝕏 →

Related Models

🔍

Total Score

461

omnivision-968M

NexaAIDev

Omnivision is a compact, sub-billion (968M) multimodal model developed by NexaAIDev. It is designed for processing both visual and text inputs, making it well-suited for on-device applications. Compared to the previous LLaVA architecture, Omnivision features a 9x token reduction, reducing image tokens from 729 to 81, which cuts latency and computational cost. The model also incorporates trustworthy result training using DPO (Decoding Probability Optimization) to reduce hallucinations. Model inputs and outputs Omnivision is a multimodal model that can process both visual and text inputs. It is capable of performing tasks such as Visual Question Answering and Image Captioning. Inputs Images Text-based queries or prompts Outputs Answers to questions about images Captions describing the contents of images Capabilities Omnivision demonstrates improved performance compared to the previous nanoLLAVA model across various benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, and POPE. The model shows particular strengths in Visual Question Answering and Image Captioning tasks, making it well-suited for on-device applications that require these capabilities. What can I use it for? Omnivision is designed for use cases that involve processing visual and textual information, such as: Visual question answering: Answering questions about the contents of images Image captioning: Generating natural language descriptions of images On-device applications: The model's small size and low computational requirements make it suitable for deployment on edge devices, enabling real-time visual-language processing. Things to try With Omnivision, you can experiment with a variety of visual-language tasks, such as: Answering questions about the contents of images Generating captions for images Exploring the model's capabilities on diverse datasets, including those related to charts, science, and more You can get started by trying the interactive demo or setting up the model locally using the Nexa-SDK framework.

Read more

Updated 12/8/2024

Text-to-Image

🚀

Total Score

218

Ovis1.6-Gemma2-9B

AIDC-AI

Ovis1.6-Gemma2-9B is a novel Multimodal Large Language Model (MLLM) architecture developed by AIDC-AI. It is designed to structurally align visual and textual embeddings, building upon the previous Ovis1.5 model. The Ovis model aims to enhance high-resolution image processing, train on larger and more diverse datasets, and refine the training process with DPO training following instruction-tuning. Model inputs and outputs Ovis1.6-Gemma2-9B is a multimodal model that can process both text and image inputs. It takes in a combination of text prompts and images, and generates relevant responses. Inputs Text prompt**: A text-based prompt that describes the desired output. Image**: An image that provides visual context for the text prompt. Outputs Generated text**: The model's response, generated based on the provided text prompt and image. Capabilities With just 10 billion parameters, Ovis1.6-Gemma2-9B leads the OpenCompass benchmark among open-source MLLMs within 30 billion parameters. This demonstrates the model's impressive performance and efficiency. What can I use it for? Ovis1.6-Gemma2-9B can be used for a variety of multimodal tasks, such as image captioning, visual question answering, and text-to-image generation. The model's ability to align visual and textual embeddings makes it a powerful tool for applications that require understanding the relationship between images and text. Things to try Developers and researchers can explore the capabilities of Ovis1.6-Gemma2-9B by experimenting with different text prompts and images. The model's pre-trained weights are available on the Hugging Face platform, making it accessible for further fine-tuning and customization.

Read more

Updated 10/18/2024

Text-to-Image

🔮

Total Score

461

OmniVLM-968M

NexaAIDev

OmniVLM-968M represents a compact multimodal model that processes visual and text inputs, built on LLaVA's architecture with key improvements. The model reduces image tokens from 729 to 81, cutting computational costs while maintaining strong performance. In comparison to SmolVLM Instruct and NVLM-D-72B, this model prioritizes edge device optimization and speed. Model Inputs and Outputs The model processes images and text through a streamlined architecture combining Qwen2.5-0.5B-Instruct as the base language model and SigLIP-400M for vision encoding at 384 resolution. Inputs Images**: Visual content at 384 resolution Text**: Questions and prompts about images Combined**: Sequences of images and text for multimodal tasks Outputs Text Responses**: Natural language answers and descriptions Image Captions**: Detailed scene descriptions Visual Analysis**: Answers to specific questions about image content Capabilities The architecture excels in visual question answering and image captioning tasks. The model achieves strong benchmark scores, with 71.0 on ScienceQA and 93.3 on POPE evaluations. Performance testing shows caption generation in under 2 seconds on M4 Pro Macbook, using minimal resources (988 MB RAM, 948 MB Storage). What Can I Use It For? The model serves edge computing applications in visual understanding. Like OmniLMM-12B, it targets practical applications but with a focus on lightweight deployment. Integration options include mobile apps, embedded systems, and real-time image analysis tools where processing speed and resource efficiency matter. Things to Try Test the model's visual analysis capabilities with diverse image types - from simple objects to complex scenes. The DPO training makes it reliable for accurate descriptions without fabricating details. Try sequential image analysis tasks to test its consistency and explore edge cases in lighting conditions or partial object visibility to understand its limitations.

Read more

Updated 12/8/2024

Image-to-Text

🔍

Total Score

461

omnivision-968M

NexaAIDev

Omnivision is a compact, sub-billion (968M) multimodal model developed by NexaAIDev. It is designed for processing both visual and text inputs, making it well-suited for on-device applications. Compared to the previous LLaVA architecture, Omnivision features a 9x token reduction, reducing image tokens from 729 to 81, which cuts latency and computational cost. The model also incorporates trustworthy result training using DPO (Decoding Probability Optimization) to reduce hallucinations. Model inputs and outputs Omnivision is a multimodal model that can process both visual and text inputs. It is capable of performing tasks such as Visual Question Answering and Image Captioning. Inputs Images Text-based queries or prompts Outputs Answers to questions about images Captions describing the contents of images Capabilities Omnivision demonstrates improved performance compared to the previous nanoLLAVA model across various benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, and POPE. The model shows particular strengths in Visual Question Answering and Image Captioning tasks, making it well-suited for on-device applications that require these capabilities. What can I use it for? Omnivision is designed for use cases that involve processing visual and textual information, such as: Visual question answering: Answering questions about the contents of images Image captioning: Generating natural language descriptions of images On-device applications: The model's small size and low computational requirements make it suitable for deployment on edge devices, enabling real-time visual-language processing. Things to try With Omnivision, you can experiment with a variety of visual-language tasks, such as: Answering questions about the contents of images Generating captions for images Exploring the model's capabilities on diverse datasets, including those related to charts, science, and more You can get started by trying the interactive demo or setting up the model locally using the Nexa-SDK framework.

Read more

Updated 12/8/2024

Image-to-Text