Related papers: DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

URL: http://arxiv.org/abs/2506.21316v2
Date: Wed, 16 Jul 2025 01:55:35 GMT
Title: DRISHTIKON: Visual Grounding at Multiple Granularities in Documents
Authors: Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan,
Abstract summary: DRISHTIKON is a multi-granular and multi-block visual grounding framework.<n>Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans.<n>Our findings pave the way for more robust and interpretable document understanding systems.
Score: 21.376466879737855
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for Document Intelligence and Visual Question Answering (VQA) systems. We present DRISHTIKON, a multi-granular and multi-block visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans at the block, line, word, and point levels. We introduce the Multi-Granular Visual Grounding (MGVG) benchmark, a curated test set of diverse circular notifications from various sectors, each manually annotated with fine-grained, human-verified labels across multiple granularities. Extensive experiments show that our method achieves state-of-the-art grounding accuracy, with line-level granularity providing the best balance between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations reveal that leading vision-language models struggle with precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios with multi-granular grounding support. Code and dataset are made available for future research.

Related papers

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation [5.458935851230595]
We present SCAN, a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems.<n>SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components.<n>Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.
arXiv Detail & Related papers (2025-05-20T14:03:24Z)
A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models [19.054780489639793]
This paper introduces Progressive multi-granular Vision-Language alignments (PromViL)<n>Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts.<n>By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning.
arXiv Detail & Related papers (2024-12-11T06:21:33Z)
DOGR: Towards Versatile Visual Document Grounding and Referring [47.66205811791444]
Grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction.<n>We propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data.<n>Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types.
arXiv Detail & Related papers (2024-11-26T05:38:34Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.