Related papers: Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

URL: http://arxiv.org/abs/2506.04788v1
Date: Thu, 05 Jun 2025 09:14:41 GMT
Title: Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
Authors: Jisu An, Junseok Lee, Jeoungeun Lee, Yongseok Son,
Abstract summary: Multimodal Large Language Models (MLLMs) combine pre-trained LLMs with various modality encoders.<n>This integration requires a systematic understanding of how different modalities connect to the language backbone.<n>We examine methods for transforming and aligning diverse modal inputs into the language embedding space.
Score: 2.9061423802698565
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.

Related papers

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems [6.284317913684068]
Compound Al Systems (CAIS) is an emerging paradigm that integrates large language models (LLMs) with external components, such as retrievers, agents, tools, and orchestrators.<n>Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented, lacking a unified framework for analysis, taxonomy, and evaluation.<n>This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system-level artificial intelligence.
arXiv Detail & Related papers (2025-06-05T02:34:43Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities [89.40778301238642]
Model merging is an efficient empowerment technique in the machine learning community. There is a significant gap in the literature regarding a systematic and thorough review of these techniques.
arXiv Detail & Related papers (2024-08-14T16:58:48Z)
LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z)
Advancing Graph Representation Learning with Large Language Models: A Comprehensive Survey of Techniques [37.60727548905253]
The integration of Large Language Models (LLMs) with Graph Representation Learning (GRL) marks a significant evolution in analyzing complex data structures. This collaboration harnesses the sophisticated linguistic capabilities of LLMs to improve the contextual understanding and adaptability of graph models. Despite a growing body of research dedicated to integrating LLMs into the graph domain, a comprehensive review that deeply analyzes the core components and operations is notably lacking.
arXiv Detail & Related papers (2024-02-04T05:51:14Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
Self-Supervised Multimodal Learning: A Survey [23.526389924804207]
Multimodal learning aims to understand and analyze information from multiple modalities. The heavy dependence on data paired with expensive human annotations impedes scaling up models. Given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck.
arXiv Detail & Related papers (2023-03-31T16:11:56Z)
Multimodality in Meta-Learning: A Comprehensive Survey [34.69292359136745]
This survey provides a comprehensive overview of the multimodality-based meta-learning landscape. We first formalize the definition of meta-learning and multimodality, along with the research challenges in this growing field. We then propose a new taxonomy to systematically discuss typical meta-learning algorithms combined with multimodal tasks.
arXiv Detail & Related papers (2021-09-28T09:16:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.