Related papers: Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

URL: http://arxiv.org/abs/2602.10138v1
Date: Sun, 08 Feb 2026 12:59:50 GMT
Title: Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement
Authors: Zhihang Yi, Jian Zhao, Jiancheng Lv, Tao Wang,
Abstract summary: Multimodal Large Language Models (MLLMs) are transforming chart information fusion.<n>This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion.
Score: 25.08967298618286
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

Related papers

The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection [35.503099074709006]
In recent years, the rapid evolution of large vision models (LVLMs) has driven a paradigm shift in multimodal fake news (MFND)<n>We present a historical perspective, mapping to foundation model paradigms, and discuss the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization.<n>We outline future research directions to guide the next stage of this paradigm shift.
arXiv Detail & Related papers (2026-01-16T02:40:16Z)
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention [7.511262066889113]
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding.<n>We perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs.<n>We introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts.
arXiv Detail & Related papers (2026-01-13T02:26:21Z)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM [27.800308082023285]
Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos.<n>The continuous evolution of deep model architectures has driven innovation in VAD methodologies.<n>The rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field.
arXiv Detail & Related papers (2025-07-29T10:07:24Z)
Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation [48.462734327375536]
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects.<n>Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form.<n>We propose M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding.<n>Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities.
arXiv Detail & Related papers (2025-06-02T04:00:35Z)
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z)
Graph Foundation Models for Recommendation: A Comprehensive Survey [55.70529188101446]
Large language models (LLMs) are designed to process and comprehend natural language, making both approaches highly effective and widely adopted.<n>Recent research has focused on graph foundation models (GFMs)<n>GFMs integrate the strengths of GNNs and LLMs to model complex RS problems more efficiently by leveraging the graph-based structure of user-item relationships alongside textual understanding.
arXiv Detail & Related papers (2025-02-12T12:13:51Z)
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems. The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness. This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z)
LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [31.71954519657729]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.<n>Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.<n>We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.