Related papers: On the Perception Bottleneck of VLMs for Chart Understanding

On the Perception Bottleneck of VLMs for Chart Understanding

URL: http://arxiv.org/abs/2503.18435v1
Date: Mon, 24 Mar 2025 08:33:58 GMT
Title: On the Perception Bottleneck of VLMs for Chart Understanding
Authors: Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He,
Abstract summary: Chart understanding requires models to analyze and reason about numerical data, textual elements, and complex visual components.<n>Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process.<n>In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, and the extraction bottleneck.
Score: 17.70892579781301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.

Related papers

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z)
Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes [0.0]
Vision-Aware Retrieval-Augmented Prompting (VRAP) is a generative approach that enhances Large Vision-Language Models. VRAP achieves state-of-the-art performance in fine-grained reasoning and multimodal understanding.
arXiv Detail & Related papers (2024-12-16T02:52:19Z)
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge [24.538839144639653]
Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components. These models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM)
arXiv Detail & Related papers (2024-11-25T18:33:14Z)
A-VL: Adaptive Attention for Large Vision-Language Models [10.027871150748956]
Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. Current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. We develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference.
arXiv Detail & Related papers (2024-09-23T09:22:59Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models [9.936172224069036]
We introduce a Scene Graph Expression (SGE) module in large vision-language models (VLMs) SGE module extracts and structurally expresses the complex semantic information within images. Experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks.
arXiv Detail & Related papers (2024-08-29T02:43:20Z)
Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language. We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features. Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z)
Off-policy Imitation Learning from Visual Inputs [83.22342811160114]
We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques. We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
arXiv Detail & Related papers (2021-11-08T09:06:12Z)
Visualization Techniques to Enhance Automated Event Extraction [0.0]
This case study seeks to identify potential triggers of state-led mass killings from news articles using NLP. We demonstrate how visualizations can aid in each stage, from exploratory analysis of raw data, to machine learning training analysis, and finally post-inference validation.
arXiv Detail & Related papers (2021-06-11T19:24:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.