Graph-Based Multimodal Contrastive Learning for Chart Question Answering
- URL: http://arxiv.org/abs/2501.04303v2
- Date: Mon, 07 Apr 2025 15:45:13 GMT
- Title: Graph-Based Multimodal Contrastive Learning for Chart Question Answering
- Authors: Yue Dai, Soyeon Caren Han, Wei Liu,
- Abstract summary: This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures.<n>The framework integrates both visual and textual graphs to capture structural and semantic characteristics.<n>A graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts.
- Score: 11.828192162922436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chart question answering (ChartQA) is challenged by the heterogeneous composition of chart elements and the subtle data patterns they encode. This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures. The framework integrates both visual and textual graphs to capture structural and semantic characteristics, while a graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts. Moreover, a set of tailored Chain of Thought (CoT) prompts is proposed to enhance multimodal large language models (MLLMs) in zero-s ot scenarios by mitigating hallucinations. Extensive evaluations on benchmarks including ChartQA, OpenCQA, and ChartX demonstrate significant performance improvements and validate the efficacy of the proposed approach.
Related papers
- Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.
We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.
Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z) - RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.
Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z) - End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models [0.0]
This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization.
Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion.
We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps.
arXiv Detail & Related papers (2025-02-24T19:13:45Z) - ChartAdapter: Large Vision-Language Model for Chart Summarization [13.499376163294816]
ChartAdapter is a lightweight transformer module designed to bridge the gap between charts and textual summaries.
By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization.
arXiv Detail & Related papers (2024-12-30T05:07:34Z) - InstructG2I: Synthesizing Images from Multimodal Attributed Graphs [50.852150521561676]
We propose a graph context-conditioned diffusion model called InstructG2I.
InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling.
A Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process.
arXiv Detail & Related papers (2024-10-09T17:56:15Z) - MSG-Chart: Multimodal Scene Graph for ChartQA [11.828192162922436]
Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts.
We design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns.
Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart.
arXiv Detail & Related papers (2024-08-09T04:11:23Z) - TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters.
TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z) - ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning [54.82612435284695]
We benchmark the ability of off-the-shelf Multi-modal Large Language Models (MLLMs) in the chart domain.<n>We construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data.<n>We develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns.
arXiv Detail & Related papers (2024-02-19T14:48:23Z) - StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding [54.45681512355684]
Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data.
We introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach.
arXiv Detail & Related papers (2023-09-20T12:51:13Z) - r-GAT: Relational Graph Attention Network for Multi-Relational Graphs [8.529080554172692]
Graph Attention Network (GAT) focuses on modelling simple undirected and single relational graph data only.
We propose r-GAT, a relational graph attention network to learn multi-channel entity representations.
Experiments on link prediction and entity classification tasks show that our r-GAT can model multi-relational graphs effectively.
arXiv Detail & Related papers (2021-09-13T12:43:00Z) - Group Contrastive Self-Supervised Learning on Graphs [101.45974132613293]
We study self-supervised learning on graphs using contrastive methods.
We argue that contrasting graphs in multiple subspaces enables graph encoders to capture more abundant characteristics.
arXiv Detail & Related papers (2021-07-20T22:09:21Z) - Multi-view Graph Learning by Joint Modeling of Consistency and
Inconsistency [65.76554214664101]
Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views.
We propose a new multi-view graph learning framework, which for the first time simultaneously models multi-view consistency and multi-view inconsistency in a unified objective function.
Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.
arXiv Detail & Related papers (2020-08-24T06:11:29Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z) - Tensor Graph Convolutional Networks for Multi-relational and Robust
Learning [74.05478502080658]
This paper introduces a tensor-graph convolutional network (TGCN) for scalable semi-supervised learning (SSL) from data associated with a collection of graphs, that are represented by a tensor.
The proposed architecture achieves markedly improved performance relative to standard GCNs, copes with state-of-the-art adversarial attacks, and leads to remarkable SSL performance over protein-to-protein interaction networks.
arXiv Detail & Related papers (2020-03-15T02:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.