Cross-Modal State-Space Graph Reasoning for Structured Summarization
- URL: http://arxiv.org/abs/2503.20988v1
- Date: Wed, 26 Mar 2025 21:06:56 GMT
- Title: Cross-Modal State-Space Graph Reasoning for Structured Summarization
- Authors: Hannah Kim, Sofia Martinez, Jason Lee,
- Abstract summary: Cross-modal summarization is critical for numerous applications, ranging from video analytics to medical reports.<n>We propose a textitCross-Modal State-Space Graph Reasoning (textbfCSS-GR) framework that incorporates a state-space model with graph-based message passing.<n>We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks.
- Score: 1.7766350477173578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.
Related papers
- End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models [0.0]
This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization.
Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion.
We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps.
arXiv Detail & Related papers (2025-02-24T19:13:45Z) - Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching [23.487431014596556]
Long-range interactions are essential for the correct description of complex systems in many scientific fields.
Most deep graph networks cannot really model long-range dependencies due to intrinsic limitations of (synchronous) message passing.
This work proposes a general framework that learns to mitigate these limitations.
arXiv Detail & Related papers (2023-12-27T12:49:27Z) - IDRNet: Intervention-Driven Relation Network for Semantic Segmentation [34.09179171102469]
Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks.
Despite the impressive results, existing paradigms often suffer from inadequate or ineffective contextual information aggregation.
We propose a novel textbfIntervention-textbfDriven textbfRelation textbfNetwork.
arXiv Detail & Related papers (2023-10-16T18:37:33Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - BASS: Boosting Abstractive Summarization with Unified Semantic Graph [49.48925904426591]
BASS is a framework for Boosting Abstractive Summarization based on a unified Semantic graph.
A graph-based encoder-decoder model is proposed to improve both the document representation and summary generation process.
Empirical results show that the proposed architecture brings substantial improvements for both long-document and multi-document summarization tasks.
arXiv Detail & Related papers (2021-05-25T16:20:48Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [67.47159595239798]
We apply graph convolution into the semantic segmentation task and propose an improved Laplacian.
The graph reasoning is directly performed in the original feature space organized as a spatial pyramid.
We achieve comparable performance with advantages in computational and memory overhead.
arXiv Detail & Related papers (2020-03-23T12:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.