Related papers: SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks

SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks

URL: http://arxiv.org/abs/2211.13551v2
Date: Fri, 31 Mar 2023 11:37:12 GMT
Title: SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks
Authors: Sergio Izquierdo, Javier Civera
Abstract summary: We propose a novel test-time refinement (TTR) method, denoted as SfM-TTR, to boost the performance of single-view depth networks at test time. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance.
Score: 13.249453757295086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In this work, we combine the strengths of both approaches by proposing a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the performance of single-view depth networks at test time using SfM multi-view cues. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal, fine-tuning the network encoder to learn a better representation of the test scene. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance, outperforming previous TTR baselines mainly based on photometric multi-view consistency. The code is available at https://github.com/serizba/SfM-TTR.

Related papers

How Learnable Grids Recover Fine Detail in Low Dimensions: A Neural Tangent Kernel Analysis of Multigrid Parametric Encodings [106.3726679697804]
We compare the two most common techniques for mitigating this spectral bias: Fourier feature encodings (FFE) and multigrid parametric encodings (MPE) MPEs are seen as the standard for low dimensional mappings, but MPEs often outperform them and learn representations with higher resolution and finer detail. We prove that MPEs improve a network's performance through the structure of their grid and not their learnable embedding.
arXiv Detail & Related papers (2025-04-18T02:18:08Z)
MEDeA: Multi-view Efficient Depth Adjustment [45.90423821963144]
MEDeA is an efficient multi-view test-time depth adjustment method that is an order of magnitude faster than existing test-time approaches. Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.
arXiv Detail & Related papers (2024-06-17T19:39:13Z)
Pushing the Efficiency Limit Using Structured Sparse Convolutions [82.31130122200578]
We propose Structured Sparse Convolution (SSC), which leverages the inherent structure in images to reduce the parameters in the convolutional filter. We show that SSC is a generalization of commonly used layers (depthwise, groupwise and pointwise convolution) in efficient architectures'' Architectures based on SSC achieve state-of-the-art performance compared to baselines on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet classification benchmarks.
arXiv Detail & Related papers (2022-10-23T18:37:22Z)
Self-distilled Feature Aggregation for Self-supervised Monocular Depth Estimation [11.929584800629673]
We propose the Self-Distilled Feature Aggregation (SDFA) module for simultaneously aggregating a pair of low-scale and high-scale features. We propose an SDFA-based network for self-supervised monocular depth estimation, and design a self-distilled training strategy to train the proposed network. Experimental results on the KITTI dataset demonstrate that the proposed method outperforms the comparative state-of-the-art methods in most cases.
arXiv Detail & Related papers (2022-09-15T07:00:52Z)
TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework. For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments. TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z)
VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction [71.83308989022635]
In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results. Our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume. In order to improve the matching performance between images acquired from very different viewpoints, we introduce a rotation-invariant 3D convolution kernel called PosedConv.
arXiv Detail & Related papers (2021-08-19T11:33:58Z)
Monocular Depth Parameterizing Networks [15.791732557395552]
We propose a network structure that provides a parameterization of a set of depth maps with feasible shapes. This allows us to search the shapes for a photo consistent solution with respect to other images. Our experimental evaluation shows that our method generates more accurate depth maps and generalizes better than competing state-of-the-art approaches.
arXiv Detail & Related papers (2020-12-21T13:02:41Z)
Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video. Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer. To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts. We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
MSDPN: Monocular Depth Prediction with Partial Laser Observation using Multi-stage Neural Networks [1.1602089225841632]
We propose a deep-learning-based multi-stage network architecture called Multi-Stage Depth Prediction Network (MSDPN) MSDPN is proposed to predict a dense depth map using a 2D LiDAR and a monocular camera. As verified experimentally, our network yields promising performance against state-of-the-art methods.
arXiv Detail & Related papers (2020-08-04T08:27:40Z)
Single Image Depth Estimation Trained via Depth from Defocus Cues [105.67073923825842]
Estimating depth from a single RGB image is a fundamental task in computer vision. In this work, we rely, instead of different views, on depth from focus cues. We present results that are on par with supervised methods on KITTI and Make3D datasets and outperform unsupervised learning approaches.
arXiv Detail & Related papers (2020-01-14T20:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.