PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic
Segmentation
- URL: http://arxiv.org/abs/2112.02582v1
- Date: Sun, 5 Dec 2021 14:31:47 GMT
- Title: PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic
Segmentation
- Authors: Haobo Yuan, Xiangtai Li, Yibo Yang, Guangliang Cheng, Jing Zhang,
Yunhai Tong, Lefei Zhang, Dacheng Tao
- Abstract summary: We present PolyphonicFormer, a vision transformer to unify all the sub-tasks under the DVPS task.
Our method explores the relationship between depth estimation and panoptic segmentation via query-based learning.
Our method ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.
- Score: 90.26723865198348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed Depth-aware Video Panoptic Segmentation (DVPS) aims to
predict panoptic segmentation results and depth maps in a video, which is a
challenging scene understanding problem. In this paper, we present
PolyphonicFormer, a vision transformer to unify all the sub-tasks under the
DVPS task. Our method explores the relationship between depth estimation and
panoptic segmentation via query-based learning. In particular, we design three
different queries including thing query, stuff query, and depth query. Then we
propose to learn the correlations among these queries via gated fusion. From
the experiments, we prove the benefits of our design from both depth estimation
and panoptic segmentation aspects. Since each thing query also encodes the
instance-wise information, it is natural to perform tracking via cropping
instance mask features with appearance learning. Our method ranks 1st on the
ICCV-2021 BMTT Challenge video + depth track. Ablation studies are reported to
show how we improve the performance. Code will be available at
https://github.com/HarborYuan/PolyphonicFormer.
Related papers
- Towards Deeply Unified Depth-aware Panoptic Segmentation with
Bi-directional Guidance Learning [63.63516124646916]
We propose a deeply unified framework for depth-aware panoptic segmentation.
We propose a bi-directional guidance learning approach to facilitate cross-task feature learning.
Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets.
arXiv Detail & Related papers (2023-07-27T11:28:33Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - PanDepth: Joint Panoptic Segmentation and Depth Completion [19.642115764441016]
We propose a multi-task model for panoptic segmentation and depth completion using RGB images and sparse depth maps.
Our model successfully predicts fully dense depth maps and performs semantic segmentation, instance segmentation, and panoptic segmentation for every input frame.
arXiv Detail & Related papers (2022-12-29T05:37:38Z) - MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to
Depth-aware Video Panoptic Segmentation [3.2489082010225494]
We propose a novel solution with a multi-task network that performs monocular depth estimation and video panoptic segmentation.
We introduce panoptic-guided depth losses and a novel panoptic masking scheme for moving objects to avoid corrupting the training signal.
arXiv Detail & Related papers (2022-10-14T07:00:42Z) - JPerceiver: Joint Perception Network for Depth, Pose and Layout
Estimation in Driving Scenes [75.20435924081585]
JPerceiver can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence.
It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO.
Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks.
arXiv Detail & Related papers (2022-07-16T10:33:59Z) - PanopticDepth: A Unified Framework for Depth-aware Panoptic Segmentation [41.85216306978024]
We propose a unified framework for depth-aware panoptic segmentation (DPS)
We generate instance-specific kernels to predict depth and segmentation masks for each instance.
We add additional instance-level depth cues to assist with supervising the depth learning via a new depth loss.
arXiv Detail & Related papers (2022-06-01T13:00:49Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic
Segmentation [31.078913193966585]
We present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision.
ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation.
On the individual sub-tasks, ViP-DeepLab achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian.
arXiv Detail & Related papers (2020-12-09T19:00:35Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.