Global Regulation and Excitation via Attention Tuning for Stereo Matching
- URL: http://arxiv.org/abs/2509.15891v1
- Date: Fri, 19 Sep 2025 11:42:07 GMT
- Title: Global Regulation and Excitation via Attention Tuning for Stereo Matching
- Authors: Jiahao Li, Xinhong Chen, Zhengmin Jiang, Qian Zhou, Yung-Hui Li, Jianping Wang,
- Abstract summary: We propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules.<n>Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details.<n>This framework demonstrates superior performance in challenging ill-posed regions.
- Score: 21.40608901253552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.
Related papers
- OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z) - CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding [5.663297699303346]
We introduce CogStereo, a novel framework that addresses challenging regions without relying on dataset-specific priors.<n>CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors.<n>CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches.
arXiv Detail & Related papers (2025-10-25T02:09:04Z) - GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework for robust visual speech recognition.<n>GLip learns to align both global and local visual features with corresponding acoustic speech units.<n>It consistently outperforms existing methods on the LRS2 and LRS3 benchmarks.
arXiv Detail & Related papers (2025-09-19T14:36:01Z) - CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization [35.73353140683283]
Increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries.<n>In this paper, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process.<n>We propose CLUE, a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor.
arXiv Detail & Related papers (2025-08-10T16:22:30Z) - World-Consistent Data Generation for Vision-and-Language Navigation [33.13590164890286]
Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions.<n>One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments.<n>We propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency.
arXiv Detail & Related papers (2024-12-09T11:40:54Z) - Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching [17.344430840048094]
Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies.
We develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation.
Unlike the existing methods, we model this task as local matching and global aggregation.
arXiv Detail & Related papers (2024-03-16T01:38:28Z) - GUESR: A Global Unsupervised Data-Enhancement with Bucket-Cluster
Sampling for Sequential Recommendation [58.6450834556133]
We propose graph contrastive learning to enhance item representations with complex associations from the global view.
We extend the CapsNet module with the elaborately introduced target-attention mechanism to derive users' dynamic preferences.
Our proposed GUESR could not only achieve significant improvements but also could be regarded as a general enhancement strategy.
arXiv Detail & Related papers (2023-03-01T05:46:36Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z) - PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo
Matching [14.603116313499648]
We propose a robust and effective self-supervised stereo matching approach, consisting of a pyramid voting module (PVM) and a novel DCNN architecture, referred to as OptStereo.
Specifically, our OptStereo first builds multi-scale cost volumes, and then adopts a recurrent unit to iteratively update disparity estimations at high resolution.
We publish the HKUST-Drive dataset, a large-scale synthetic stereo dataset, collected under different illumination and weather conditions for research purposes.
arXiv Detail & Related papers (2021-03-12T05:27:14Z) - Global Context Aware RCNN for Object Detection [1.1939762265857436]
We propose a novel end-to-end trainable framework, called Global Context Aware (GCA) RCNN.
The core component of GCA framework is a context aware mechanism, in which both global feature pyramid and attention strategies are used for feature extraction and feature refinement.
In the end, we also present a lightweight version of our method, which only slightly increases model complexity and computational burden.
arXiv Detail & Related papers (2020-12-04T14:56:46Z) - AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching [50.06646151004375]
A novel domain-adaptive pipeline called AdaStereo aims to align multi-level representations for deep stereo matching networks.
Our AdaStereo models achieve state-of-the-art cross-domain performance on multiple stereo benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo.
arXiv Detail & Related papers (2020-04-09T16:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.