Related papers: A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

URL: http://arxiv.org/abs/2507.22938v1
Date: Fri, 25 Jul 2025 07:36:13 GMT
Title: A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents
Authors: Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain, Pranav Gangrade, Ayaaz Khan,
Abstract summary: Question-Answering from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams.<n>We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain.
Score: 0.619840955350879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

Related papers

Describe Anything Model for Visual Question Answering on Text-rich Images [7.618388911738171]
We introduce DAM-QA, a framework to harness the region-aware capabilities from DAM for the text-rich Visual Question Answering problem.<n>Our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA.<n>Results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies.
arXiv Detail & Related papers (2025-07-16T17:28:19Z)
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
Optimizing open-domain question answering with graph-based retrieval augmented generation [5.2850605665217865]
We benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types.<n>Traditional RAG methods often fall short in handling nuanced, multi-document tasks.<n>We introduce TREX, a novel, cost-effective alternative that combines graph-based synthesis and vector-based retrieval techniques.
arXiv Detail & Related papers (2025-03-04T18:47:17Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions [0.0]
Diagrams play crucial role in visually conveying complex relationships and processes within business documentation.<n>Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting structures in diagrams continues to pose significant challenges.<n>This study proposes a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities.
arXiv Detail & Related papers (2025-02-05T23:40:26Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z)
Knowledge Graph Prompting for Multi-Document Question Answering [46.29217406937293]
We propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting multi-document question answering (MD-QA) For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables)
arXiv Detail & Related papers (2023-08-22T18:41:31Z)
Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR) It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z)
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image. We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z)
Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.