Related papers: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

URL: http://arxiv.org/abs/2412.05939v1
Date: Sun, 08 Dec 2024 13:45:44 GMT
Title: Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Authors: Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan,
Abstract summary: We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs.<n>Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework.<n>We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks.
Score: 55.25892137362187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.

Related papers

Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning [23.089644598166885]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities.<n>Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems.<n>Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs.
arXiv Detail & Related papers (2025-06-12T01:44:46Z)
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z)
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. We propose TAMP, a simple yet effective pruning framework tailored for MLLMs. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z)
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs [34.076036577516895]
Texts and images are usually interconnected, forming a multimodal attributed graph (MMAG)<n>It is underexplored how MLLMs can incorporate the relational information (textiti.e., graph structure) and semantic information (textiti.e. texts and images) on such graphs for multimodal comprehension and generation.<n>We propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.
arXiv Detail & Related papers (2025-02-17T15:35:36Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored. We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.