MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to
Depth-aware Video Panoptic Segmentation
- URL: http://arxiv.org/abs/2210.07577v1
- Date: Fri, 14 Oct 2022 07:00:42 GMT
- Title: MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to
Depth-aware Video Panoptic Segmentation
- Authors: Andra Petrovai and Sergiu Nedevschi
- Abstract summary: We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation.
We introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal.
- Score: 3.2489082010225494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth-aware video panoptic segmentation tackles the inverse projection
problem of restoring panoptic 3D point clouds from video sequences, where the
3D points are augmented with semantic classes and temporally consistent
instance identifiers. We propose a novel solution with a multi-task network
that performs monocular depth estimation and video panoptic segmentation. Since
acquiring ground truth labels for both depth and image segmentation has a
relatively large cost, we leverage the power of unlabeled video sequences with
self-supervised monocular depth estimation and semi-supervised learning from
pseudo-labels for video panoptic segmentation. To further improve the depth
prediction, we introduce panoptic-guided depth losses and a novel panoptic
masking scheme for moving objects to avoid corrupting the training signal.
Extensive experiments on the Cityscapes-DVPS and SemKITTI-DVPS datasets
demonstrate that our model with the proposed improvements achieves competitive
results and fast inference speed.
Related papers
- Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - A Simple Baseline for Supervised Surround-view Depth Estimation [25.81521612343612]
We propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation.
We employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations.
Our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
arXiv Detail & Related papers (2023-03-14T10:06:19Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - PanopticDepth: A Unified Framework for Depth-aware Panoptic Segmentation [41.85216306978024]
We propose a unified framework for depth-aware panoptic segmentation (DPS)
We generate instance-specific kernels to predict depth and segmentation masks for each instance.
We add additional instance-level depth cues to assist with supervising the depth learning via a new depth loss.
arXiv Detail & Related papers (2022-06-01T13:00:49Z) - PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic
Segmentation [90.26723865198348]
We present PolyphonicFormer, a vision transformer to unify all the sub-tasks under the DVPS task.
Our method explores the relationship between depth estimation and panoptic segmentation via query-based learning.
Our method ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.
arXiv Detail & Related papers (2021-12-05T14:31:47Z) - Consistent Depth of Moving Objects in Video [52.72092264848864]
We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera.
We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction over the entire input video.
We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars) as well as camera motion.
arXiv Detail & Related papers (2021-08-02T20:53:18Z) - Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes [87.91841050957714]
We present an unsupervised monocular framework for dense depth estimation of dynamic scenes.
We derive a training objective that aims to opportunistically preserve pairwise distances between reconstructed 3D points.
Our method provides promising results, demonstrating its capability of reconstructing 3D from challenging videos of non-rigid scenes.
arXiv Detail & Related papers (2020-12-31T16:02:03Z) - ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic
Segmentation [31.078913193966585]
We present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision.
ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation.
On the individual sub-tasks, ViP-DeepLab achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian.
arXiv Detail & Related papers (2020-12-09T19:00:35Z) - Self-Attention Dense Depth Estimation Network for Unrectified Video
Sequences [6.821598757786515]
LiDAR and radar sensors are the hardware solution for real-time depth estimation.
Deep learning based self-supervised depth estimation methods have shown promising results.
We propose a self-attention based depth and ego-motion network for unrectified images.
arXiv Detail & Related papers (2020-05-28T21:53:53Z) - Improving Semantic Segmentation through Spatio-Temporal Consistency
Learned from Videos [39.25927216187176]
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve single-image semantic segmentation.
The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model.
arXiv Detail & Related papers (2020-04-11T07:09:29Z) - Video Depth Estimation by Fusing Flow-to-Depth Proposals [65.24533384679657]
We present an approach with a differentiable flow-to-depth layer for video depth estimation.
The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network.
Our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability.
arXiv Detail & Related papers (2019-12-30T10:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.