DRISHTIKON: Visual Grounding at Multiple Granularities in Documents
- URL: http://arxiv.org/abs/2506.21316v2
- Date: Wed, 16 Jul 2025 01:55:35 GMT
- Title: DRISHTIKON: Visual Grounding at Multiple Granularities in Documents
- Authors: Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan,
- Abstract summary: DRISHTIKON is a multi-granular and multi-block visual grounding framework.<n>Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans.<n>Our findings pave the way for more robust and interpretable document understanding systems.
- Score: 21.376466879737855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for Document Intelligence and Visual Question Answering (VQA) systems. We present DRISHTIKON, a multi-granular and multi-block visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans at the block, line, word, and point levels. We introduce the Multi-Granular Visual Grounding (MGVG) benchmark, a curated test set of diverse circular notifications from various sectors, each manually annotated with fine-grained, human-verified labels across multiple granularities. Extensive experiments show that our method achieves state-of-the-art grounding accuracy, with line-level granularity providing the best balance between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations reveal that leading vision-language models struggle with precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios with multi-granular grounding support. Code and dataset are made available for future research.
Related papers
- MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z) - SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation [5.458935851230595]
We present SCAN, a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems.<n>SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components.<n>Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.
arXiv Detail & Related papers (2025-05-20T14:03:24Z) - A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models [19.054780489639793]
This paper introduces Progressive multi-granular Vision-Language alignments (PromViL)<n>Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts.<n>By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning.
arXiv Detail & Related papers (2024-12-11T06:21:33Z) - DOGR: Towards Versatile Visual Document Grounding and Referring [47.66205811791444]
Grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction.<n>We propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data.<n>Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types.
arXiv Detail & Related papers (2024-11-26T05:38:34Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - MGDoc: Pre-training with Multi-granular Hierarchy for Document Image
Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks.
Existing methods learn features from either word-level or region-level but fail to consider both simultaneously.
We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.