Self-supervised monocular depth estimation from oblique UAV videos
- URL: http://arxiv.org/abs/2012.10704v1
- Date: Sat, 19 Dec 2020 14:53:28 GMT
- Title: Self-supervised monocular depth estimation from oblique UAV videos
- Authors: Logambal Madhuanand, Francesco Nex, Michael Ying Yang
- Abstract summary: This paper aims to estimate depth from a single UAV aerial image using deep learning.
We propose a novel architecture with two 2D CNN encoders and a 3D CNN decoder for extracting information from consecutive temporal frames.
- Score: 8.876469413317341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: UAVs have become an essential photogrammetric measurement as they are
affordable, easily accessible and versatile. Aerial images captured from UAVs
have applications in small and large scale texture mapping, 3D modelling,
object detection tasks, DTM and DSM generation etc. Photogrammetric techniques
are routinely used for 3D reconstruction from UAV images where multiple images
of the same scene are acquired. Developments in computer vision and deep
learning techniques have made Single Image Depth Estimation (SIDE) a field of
intense research. Using SIDE techniques on UAV images can overcome the need for
multiple images for 3D reconstruction. This paper aims to estimate depth from a
single UAV aerial image using deep learning. We follow a self-supervised
learning approach, Self-Supervised Monocular Depth Estimation (SMDE), which
does not need ground truth depth or any extra information other than images for
learning to estimate depth. Monocular video frames are used for training the
deep learning model which learns depth and pose information jointly through two
different networks, one each for depth and pose. The predicted depth and pose
are used to reconstruct one image from the viewpoint of another image utilising
the temporal information from videos. We propose a novel architecture with two
2D CNN encoders and a 3D CNN decoder for extracting information from
consecutive temporal frames. A contrastive loss term is introduced for
improving the quality of image generation. Our experiments are carried out on
the public UAVid video dataset. The experimental results demonstrate that our
model outperforms the state-of-the-art methods in estimating the depths.
Related papers
- Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - Motion Degeneracy in Self-supervised Learning of Elevation Angle
Estimation for 2D Forward-Looking Sonar [4.683630397028384]
This study aims to realize stable self-supervised learning of elevation angle estimation without pretraining using synthetic images.
We first analyze the motion field of 2D forward-looking sonar, which is related to the main supervision signal.
arXiv Detail & Related papers (2023-07-30T08:06:11Z) - Lightweight Monocular Depth Estimation [4.19709743271943]
We create a lightweight machine-learning model in order to predict the depth value of each pixel given only a single RGB image as input with the Unet structure of the image segmentation network.
The proposed method achieves relatively high accuracy and low rootmean-square error.
arXiv Detail & Related papers (2022-12-21T21:05:16Z) - Depth Is All You Need for Monocular 3D Detection [29.403235118234747]
We propose to align depth representation with the target domain in unsupervised fashions.
Our methods leverage commonly available LiDAR or RGB videos during training time to fine-tune the depth representation, which leads to improved 3D detectors.
arXiv Detail & Related papers (2022-10-05T18:12:30Z) - Towards Accurate Reconstruction of 3D Scene Shape from A Single
Monocular Image [91.71077190961688]
We propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image.
We then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes.
We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation.
arXiv Detail & Related papers (2022-08-28T16:20:14Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Real-time dense 3D Reconstruction from monocular video data captured by
low-cost UAVs [0.3867363075280543]
Real-time 3D reconstruction enables fast dense mapping of the environment which benefits numerous applications, such as navigation or live evaluation of an emergency.
In contrast to most real-time capable approaches, our approach does not need an explicit depth sensor.
By exploiting the self-motion of the unmanned aerial vehicle (UAV) flying with oblique view around buildings, we estimate both camera trajectory and depth for selected images with enough novel content.
arXiv Detail & Related papers (2021-04-21T13:12:17Z) - Self-Attention Dense Depth Estimation Network for Unrectified Video
Sequences [6.821598757786515]
LiDAR and radar sensors are the hardware solution for real-time depth estimation.
Deep learning based self-supervised depth estimation methods have shown promising results.
We propose a self-attention based depth and ego-motion network for unrectified images.
arXiv Detail & Related papers (2020-05-28T21:53:53Z) - From Image Collections to Point Clouds with Self-supervised Shape and
Pose Networks [53.71440550507745]
Reconstructing 3D models from 2D images is one of the fundamental problems in computer vision.
We propose a deep learning technique for 3D object reconstruction from a single image.
We learn both 3D point cloud reconstruction and pose estimation networks in a self-supervised manner.
arXiv Detail & Related papers (2020-05-05T04:25:16Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.