Mask4Former: Mask Transformer for 4D Panoptic Segmentation
- URL: http://arxiv.org/abs/2309.16133v2
- Date: Wed, 10 Apr 2024 21:19:33 GMT
- Title: Mask4Former: Mask Transformer for 4D Panoptic Segmentation
- Authors: Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe,
- Abstract summary: Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking.
Our model directly predicts semantic instances their temporal associations without relying on hand-crafted non-learned association strategies.
Mask4Former achieves a new state-of-the-art on the SemanticTITI test set with a score of 68.4 LSTQ.
- Score: 13.99703660936949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4Former for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4Former introduces spatio-temporal instance queries that encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that promoting spatially compact instance predictions is critical as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which are used as an auxiliary task to foster spatially compact predictions. Mask4Former achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ.
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond.
We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner.
Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction [48.57732508537554]
We develop rotation-equivariant neural networks for 4D panoptic segmentation.
We show that our models achieve higher accuracy with lower computational costs compared to their non-equivariant counterparts.
Our method sets the new state-of-the-art performance and achieves 1st place on the Semantic KITTITI 4D Panoptic leaderboard.
arXiv Detail & Related papers (2023-03-28T00:20:37Z) - Sparse Instance Activation for Real-Time Instance Segmentation [72.23597664935684]
We propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation.
SparseInst has extremely fast inference speed and achieves 40 FPS and 37.9 AP on the COCO benchmark.
arXiv Detail & Related papers (2022-03-24T03:15:39Z) - 4D Panoptic LiDAR Segmentation [27.677435778317054]
We propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points.
Inspired by recent advances in benchmarking of multi-object tracking, we propose to adopt a new evaluation metric that separates the semantic and point-to-instance association of the task.
arXiv Detail & Related papers (2021-02-24T18:56:16Z) - PointINS: Point-based Instance Segmentation [117.38579097923052]
Mask representation in instance segmentation with Point-of-Interest (PoI) features is challenging because learning a high-dimensional mask feature for each instance requires a heavy computing burden.
We propose an instance-aware convolution, which decomposes this mask representation learning task into two tractable modules.
Along with instance-aware convolution, we propose PointINS, a simple and practical instance segmentation approach.
arXiv Detail & Related papers (2020-03-13T08:24:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.