Related papers: Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

URL: http://arxiv.org/abs/2403.10755v1
Date: Sat, 16 Mar 2024 01:38:28 GMT
Title: Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching
Authors: Junpeng Jing, Ye Mao, Krystian Mikolajczyk,
Abstract summary: Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. We develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Unlike the existing methods, we model this task as local matching and global aggregation.
Score: 17.344430840048094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.

Related papers

Adaptive Keyframe Selection for Scalable 3D Scene Reconstruction in Dynamic Environments [10.967576917866408]
We propose an adaptive selection method for improved 3D scene reconstruction in dynamic environments.<n>The proposed method integrates two complementary modules: an error-based selection module and a momentum-based update module.<n>We evaluate our proposed adaptive selection module on two recent state-of-the-art 3D reconstruction networks, Spann3r and CUT3R.
arXiv Detail & Related papers (2025-10-27T23:25:57Z)
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z)
Stereo Any Video: Temporally Consistent Stereo Matching [15.876953256378224]
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. Key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence.
arXiv Detail & Related papers (2025-03-07T16:20:36Z)
Match Stereo Videos via Bidirectional Alignment [15.876953256378224]
Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. We introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. We present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation.
arXiv Detail & Related papers (2024-09-30T13:37:29Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching [77.133400999703]
Correlation based stereo matching has achieved outstanding performance. Current methods with a fixed model do not work uniformly well across various datasets. This paper proposes a new perspective to dynamically calculate correlation for robust stereo matching.
arXiv Detail & Related papers (2023-07-26T09:47:37Z)
DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z)
A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization [12.951369232106178]
Performance-score synchronization is an integral task in signal processing. Traditional synchronization methods compute alignment using knowledge-driven approaches. We present a novel data-driven method for structure-score synchronization.
arXiv Detail & Related papers (2022-04-19T11:41:21Z)
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation. We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation. In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z)
3D Skeleton-based Few-shot Action Recognition with JEANIE is not so Na\"ive [28.720272938306692]
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2021-12-23T16:09:23Z)
AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks. Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo. Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z)
AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching [50.06646151004375]
A novel domain-adaptive pipeline called AdaStereo aims to align multi-level representations for deep stereo matching networks. Our AdaStereo models achieve state-of-the-art cross-domain performance on multiple stereo benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo.
arXiv Detail & Related papers (2020-04-09T16:15:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.