Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
- URL: http://arxiv.org/abs/2508.16974v1
- Date: Sat, 23 Aug 2025 09:57:52 GMT
- Title: Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
- Authors: Leilei Guo, Antonio Carlos Rivera, Peiyu Tang, Haoxuan Ren, Zheyu Song,
- Abstract summary: Hierarchical Contextual Grounding LVLM (HCG-LVLM) is a novel architecture that mimics human coarse-to-fine cognitive processing.<n>Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design.
- Score: 0.3262230127283452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) have achieved remarkable progress in natural language processing and multimodal understanding. Despite their impressive generalization capabilities, current LVLMs often exhibit insufficient robustness, proneness to hallucination, and reasoning errors in complex real-world scenarios, particularly when precise image region localization and fine-grained visual reasoning are required. To address these limitations, we propose the Hierarchical Contextual Grounding LVLM (HCG-LVLM), a novel architecture that mimics human coarse-to-fine cognitive processing. HCG-LVLM employs a two-layered approach: a Global Contextual Perception layer for initial broad understanding and a Fine-grained Local Grounding layer. The latter incorporates a Local Detail Enhancement Module to extract high-resolution features and a Semantic Consistency Validator to ensure accurate, hallucination-free visual-language alignment. Through an adaptive fusion mechanism, information from both layers is integrated for robust and precise outputs. Extensive experiments on challenging datasets, including GQA, A-OKVQA for fine-grained VQA, and RefCOCO/+/g for Referring Expression Comprehension, demonstrate that HCG-LVLM consistently outperforms state-of-the-art models such as Flamingo, BLIP-2, and MiniGPT-4. Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design in enhancing fine-grained visual-language understanding and precise grounding capabilities.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z) - BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion [7.382475458362566]
We present BREATH-VL, a hybrid framework that integrates semantic cues from vision-language models with geometric information from registration methods for accurate 6-DoF pose estimation.<n>Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline.
arXiv Detail & Related papers (2026-01-07T09:00:52Z) - The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs [44.71703930770065]
We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
arXiv Detail & Related papers (2025-12-17T20:22:23Z) - VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs [13.486495756813078]
Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization.<n>We introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception into a robust feature retrieval task.<n>Our method operates as a plug-and-play module that integrates with any pre-trained VLM.
arXiv Detail & Related papers (2025-09-30T08:10:56Z) - Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models [60.39744129890118]
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages.<n>In this work, we identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers.<n>We introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning.
arXiv Detail & Related papers (2025-08-25T18:15:25Z) - Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation [42.78181795494584]
generative model designed to significantly advance text-to-image synthesis.<n>Hi-SSLVLM addresses limitations through a unique two-stage self-supervised learning strategy.<n> experiments demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics.
arXiv Detail & Related papers (2025-07-05T20:16:32Z) - GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models [0.0]
We introduce GLIMPSE, a model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals.<n>GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps.<n>We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.
arXiv Detail & Related papers (2025-06-23T18:00:04Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models [30.20915403608803]
Griffon is a language-prompted localization dataset for large vision language models.
It is trained end-to-end through a well-designed pipeline.
It achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities.
arXiv Detail & Related papers (2023-11-24T15:35:07Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.