Related papers: OmniFusion Technical Report

OmniFusion Technical Report

URL: http://arxiv.org/abs/2404.06212v1
Date: Tue, 9 Apr 2024 11:00:19 GMT
Title: OmniFusion Technical Report
Authors: Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov,
Abstract summary: We propose an textit OmniFusion model based on a pretrained large language model (LLM) We evaluate and compare several architecture design principles for better text and visual data coupling. Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup.
Score: 7.332426123896801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

Related papers

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs [42.57007182613631]
We construct a benchmark to fairly benchmark over 30 different MLLMs. We present CoLVA, a novel contrastive MLLM with object-level contrastive learning and instruction augmentation strategy. Results show that CoLVA achieves 51.06% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41% and 23.58% OA, respectively.
arXiv Detail & Related papers (2025-01-08T18:30:53Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Ovis: Structural Embedding Alignment for Multimodal Large Language Model [41.32013722697081]
Ovis is a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs.
arXiv Detail & Related papers (2024-05-31T13:59:18Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z)
OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities. The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM. Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks. We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.