Related papers: mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

URL: http://arxiv.org/abs/2408.04840v2
Date: Tue, 13 Aug 2024 08:10:32 GMT
Title: mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Authors: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou,
Abstract summary: We introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space.
Score: 71.40705814904898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Related papers

Show-o2: Improved Native Unified Multimodal Models [57.34173415412808]
Show-o2 is a native unified multimodal models that leverage autoregressive modeling and flow matching.<n>Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion.
arXiv Detail & Related papers (2025-06-18T15:39:15Z)
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models [12.265270657795275]
ImageChain is a framework that enhances MLLMs with sequential reasoning capabilities over image data. Our approach improves performance on the next-scene description task. ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics.
arXiv Detail & Related papers (2025-02-26T18:55:06Z)
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z)
MammothModa: Multi-Modal Large Language Model [17.98445238232718]
We introduce MammothModa, yet another multi-modal large language model (MLLM) MammothModa consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
arXiv Detail & Related papers (2024-06-26T09:17:27Z)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images. Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model. It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z)
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images. It is the first multi-modal model trained with a recipe adapted from text-only language models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.