Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic
Segmentation
- URL: http://arxiv.org/abs/2204.07548v1
- Date: Fri, 15 Apr 2022 17:10:48 GMT
- Title: Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic
Segmentation
- Authors: Damien Robert, Bruno Vallet, Loic Landrieu
- Abstract summary: Recent works on 3D semantic segmentation propose to exploit the synergy between images and point clouds by processing each modality with a dedicated network.
We propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions.
Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks.
- Score: 3.5939555573102853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works on 3D semantic segmentation propose to exploit the synergy
between images and point clouds by processing each modality with a dedicated
network and projecting learned 2D features onto 3D points. Merging large-scale
point clouds and images raises several challenges, such as constructing a
mapping between points and pixels, and aggregating features between multiple
views. Current methods require mesh reconstruction or specialized sensors to
recover occlusions, and use heuristics to select and aggregate available
images. In contrast, we propose an end-to-end trainable multi-view aggregation
model leveraging the viewing conditions of 3D points to merge features from
images taken at arbitrary positions. Our method can combine standard 2D and 3D
networks and outperforms both 3D models operating on colorized point clouds and
hybrid 2D/3D networks without requiring colorization, meshing, or true depth
maps. We set a new state-of-the-art for large-scale indoor/outdoor semantic
segmentation on S3DIS (74.7 mIoU 6-Fold) and on KITTI-360 (58.3 mIoU). Our full
pipeline is accessible at https://github.com/drprojects/DeepViewAgg, and only
requires raw 3D scans and a set of images and poses.
Related papers
- SAM-guided Graph Cut for 3D Instance Segmentation [60.75119991853605]
This paper addresses the challenge of 3D instance segmentation by simultaneously leveraging 3D geometric and multi-view image information.
We introduce a novel 3D-to-2D query framework to effectively exploit 2D segmentation models for 3D instance segmentation.
Our method achieves robust segmentation performance and can generalize across different types of scenes.
arXiv Detail & Related papers (2023-12-13T18:59:58Z) - DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations.
We leverage the strong semantic prior within a 3D generative model to train a semantic decoder.
Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z) - Lightweight integration of 3D features to improve 2D image segmentation [1.3799488979862027]
We show that image segmentation can benefit from 3D geometric information without requiring a 3D groundtruth.
Our method can be applied to many 2D segmentation networks, improving significantly their performance.
arXiv Detail & Related papers (2022-12-16T08:22:55Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Learning 3D Semantics from Pose-Noisy 2D Images with Hierarchical Full
Attention Network [17.58032517457836]
We propose a novel framework to learn 3D point cloud semantics from 2D multi-view image observations containing pose error.
A hierarchical full attention network(HiFANet) is designed to sequentially aggregates patch, bag-of-frames and inter-point semantic cues.
Experiment results show that the proposed framework outperforms existing 3D point cloud based methods significantly.
arXiv Detail & Related papers (2022-04-17T20:24:26Z) - Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding [80.04281842702294]
We introduce the concept of the multi-view point cloud (Voint cloud) representing each 3D point as a set of features extracted from several view-points.
This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation.
We deploy a Voint neural network (VointNet) with a theoretically established functional form to learn representations in the Voint space.
arXiv Detail & Related papers (2021-11-30T13:08:19Z) - Learning 3D Semantic Segmentation with only 2D Image Supervision [18.785840615548473]
We train a 3D model from pseudo-labels derived from 2D semantic image segmentations using multiview fusion.
The proposed network architecture, 2D3DNet, achieves significantly better performance than baselines during experiments on a new urban dataset with lidar and images captured in 20 cities across 5 continents.
arXiv Detail & Related papers (2021-10-21T17:56:28Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - ParaNet: Deep Regular Representation for 3D Point Clouds [62.81379889095186]
ParaNet is a novel end-to-end deep learning framework for representing 3D point clouds.
It converts an irregular 3D point cloud into a regular 2D color image, named point geometry image (PGI)
In contrast to conventional regular representation modalities based on multi-view projection and voxelization, the proposed representation is differentiable and reversible.
arXiv Detail & Related papers (2020-12-05T13:19:55Z) - 3D Crowd Counting via Geometric Attention-guided Multi-View Fusion [50.520192402702015]
We propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps.
Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views.
The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density.
arXiv Detail & Related papers (2020-03-18T11:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.