ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic
Segmentation
- URL: http://arxiv.org/abs/2012.05258v1
- Date: Wed, 9 Dec 2020 19:00:35 GMT
- Title: ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic
Segmentation
- Authors: Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- Abstract summary: We present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision.
ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation.
On the individual sub-tasks, ViP-DeepLab achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian.
- Score: 31.078913193966585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present ViP-DeepLab, a unified model attempting to tackle
the long-standing and challenging inverse projection problem in vision, which
we model as restoring the point clouds from perspective image sequences while
providing each point with instance-level semantic interpretations. Solving this
problem requires the vision models to predict the spatial location, semantic
class, and temporally consistent instance label for each 3D point. ViP-DeepLab
approaches it by jointly performing monocular depth estimation and video
panoptic segmentation. We name this joint task as Depth-aware Video Panoptic
Segmentation, and propose a new evaluation metric along with two derived
datasets for it, which will be made available to the public. On the individual
sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming
previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI
monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The
datasets and the evaluation codes are made publicly available.
Related papers
- SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to
Depth-aware Video Panoptic Segmentation [3.2489082010225494]
We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation.
We introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal.
arXiv Detail & Related papers (2022-10-14T07:00:42Z) - PanopticDepth: A Unified Framework for Depth-aware Panoptic Segmentation [41.85216306978024]
We propose a unified framework for depth-aware panoptic segmentation (DPS)
We generate instance-specific kernels to predict depth and segmentation masks for each instance.
We add additional instance-level depth cues to assist with supervising the depth learning via a new depth loss.
arXiv Detail & Related papers (2022-06-01T13:00:49Z) - PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic
Segmentation [90.26723865198348]
We present PolyphonicFormer, a vision transformer to unify all the sub-tasks under the DVPS task.
Our method explores the relationship between depth estimation and panoptic segmentation via query-based learning.
Our method ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.
arXiv Detail & Related papers (2021-12-05T14:31:47Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Sparse Auxiliary Networks for Unified Monocular Depth Prediction and
Completion [56.85837052421469]
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
In this paper, we study the problem of predicting dense depth from a single RGB image with optional sparse measurements from low-cost active depth sensors.
We introduce Sparse Networks (SANs), a new module enabling monodepth networks to perform both the tasks of depth prediction and completion.
arXiv Detail & Related papers (2021-03-30T21:22:26Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Monocular 3D Object Detection with Sequential Feature Association and
Depth Hint Augmentation [12.55603878441083]
FADNet is presented to address the task of monocular 3D object detection.
A dedicated depth hint module is designed to generate row-wise features named as depth hints.
The contributions of this work are validated by conducting experiments and ablation study on the KITTI benchmark.
arXiv Detail & Related papers (2020-11-30T07:19:14Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.