2D-3D Interlaced Transformer for Point Cloud Segmentation with
Scene-Level Supervision
- URL: http://arxiv.org/abs/2310.12817v2
- Date: Mon, 22 Jan 2024 09:44:18 GMT
- Title: 2D-3D Interlaced Transformer for Point Cloud Segmentation with
Scene-Level Supervision
- Authors: Cheng-Kun Yang, Min-Hung Chen, Yung-Yu Chuang, Yen-Yu Lin
- Abstract summary: We propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation.
The decoder implements 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion.
Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods.
- Score: 36.282611420496416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Multimodal Interlaced Transformer (MIT) that jointly considers
2D and 3D data for weakly supervised point cloud segmentation. Research studies
have shown that 2D and 3D features are complementary for point cloud
segmentation. However, existing methods require extra 2D annotations to achieve
2D-3D information fusion. Considering the high annotation cost of point clouds,
effective 2D and 3D feature fusion based on weakly supervised learning is in
great demand. To this end, we propose a transformer model with two encoders and
one decoder for weakly supervised point cloud segmentation using only
scene-level class tags. Specifically, the two encoders compute the
self-attended features for 3D point clouds and 2D multi-view images,
respectively. The decoder implements interlaced 2D-3D cross-attention and
carries out implicit 2D and 3D feature fusion. We alternately switch the roles
of queries and key-value pairs in the decoder layers. It turns out that the 2D
and 3D features are iteratively enriched by each other. Experiments show that
it performs favorably against existing weakly supervised point cloud
segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The
project page will be available at https://jimmy15923.github.io/mit_web/.
Related papers
- ODIN: A Single Model for 2D and 3D Segmentation [34.612953668151036]
ODIN is a model that segment and label both 2D RGB images and 3D point clouds.
It achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D segmentation benchmarks.
arXiv Detail & Related papers (2024-01-04T18:59:25Z) - CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge
Distillation for LIDAR Semantic Segmentation [44.44327357717908]
2D RGB images and 3D LIDAR point clouds provide complementary knowledge for the perception system of autonomous vehicles.
Several 2D and 3D fusion methods have been explored for the LIDAR semantic segmentation task, but they suffer from different problems.
We propose a Bidirectional Fusion Network with Cross-Modality Knowledge Distillation (CMDFusion) in this work.
arXiv Detail & Related papers (2023-07-09T04:24:12Z) - Prototype Adaption and Projection for Few- and Zero-shot 3D Point Cloud
Semantic Segmentation [30.18333233940194]
We address the challenging task of few-shot and zero-shot 3D point cloud semantic segmentation.
Our proposed method surpasses state-of-the-art algorithms by a considerable 7.90% and 14.82% under the 2-way 1-shot setting on S3DIS and ScanNet benchmarks, respectively.
arXiv Detail & Related papers (2023-05-23T17:58:05Z) - EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder [60.52613206271329]
This paper introduces textbfEfficient textbfPoint textbfCloud textbfLearning (EPCL) for training high-quality point cloud models with a frozen CLIP transformer.
Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data.
arXiv Detail & Related papers (2022-12-08T06:27:11Z) - Sparse2Dense: Learning to Densify 3D Features for 3D Object Detection [85.08249413137558]
LiDAR-produced point clouds are the major source for most state-of-the-art 3D object detectors.
Small, distant, and incomplete objects with sparse or few points are often hard to detect.
We present Sparse2Dense, a new framework to efficiently boost 3D detection performance by learning to densify point clouds in latent space.
arXiv Detail & Related papers (2022-11-23T16:01:06Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Unsupervised Learning of Fine Structure Generation for 3D Point Clouds
by 2D Projection Matching [66.98712589559028]
We propose an unsupervised approach for 3D point cloud generation with fine structures.
Our method can recover fine 3D structures from 2D silhouette images at different resolutions.
arXiv Detail & Related papers (2021-08-08T22:15:31Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - ParaNet: Deep Regular Representation for 3D Point Clouds [62.81379889095186]
ParaNet is a novel end-to-end deep learning framework for representing 3D point clouds.
It converts an irregular 3D point cloud into a regular 2D color image, named point geometry image (PGI)
In contrast to conventional regular representation modalities based on multi-view projection and voxelization, the proposed representation is differentiable and reversible.
arXiv Detail & Related papers (2020-12-05T13:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.