Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via
Cross-modal Distillation
- URL: http://arxiv.org/abs/2203.11160v2
- Date: Wed, 21 Feb 2024 16:25:12 GMT
- Title: Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via
Cross-modal Distillation
- Authors: Antonin Vobecky, David Hurych, Oriane Sim\'eoni, Spyros Gidaris,
Andrei Bursuc, Patrick P\'erez, Josef Sivic
- Abstract summary: This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars.
We propose a novel method for cross-modal unsupervised learning of semantic image segmentation by leveraging synchronized LiDAR and image data.
- Score: 32.33170182669095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work investigates learning pixel-wise semantic image segmentation in
urban scenes without any manual annotation, just from the raw non-curated data
collected by cars which, equipped with cameras and LiDAR sensors, drive around
a city. Our contributions are threefold. First, we propose a novel method for
cross-modal unsupervised learning of semantic image segmentation by leveraging
synchronized LiDAR and image data. The key ingredient of our method is the use
of an object proposal module that analyzes the LiDAR point cloud to obtain
proposals for spatially consistent objects. Second, we show that these 3D
object proposals can be aligned with the input images and reliably clustered
into semantically meaningful pseudo-classes. Finally, we develop a cross-modal
distillation approach that leverages image data partially annotated with the
resulting pseudo-classes to train a transformer-based model for image semantic
segmentation. We show the generalization capabilities of our method by testing
on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving
and ACDC) without any finetuning, and demonstrate significant improvements
compared to the current state of the art on this problem. See project webpage
https://vobecant.github.io/DriveAndSegment/ for the code and more.
Related papers
- Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - Estimation of Appearance and Occupancy Information in Birds Eye View
from Surround Monocular Images [2.69840007334476]
Birds-eye View (BEV) expresses the location of different traffic participants in the ego vehicle frame from a top-down view.
We propose a novel representation that captures various traffic participants appearance and occupancy information from an array of monocular cameras covering 360 deg field of view (FOV)
We use a learned image embedding of all camera images to generate a BEV of the scene at any instant that captures both appearance and occupancy of the scene.
arXiv Detail & Related papers (2022-11-08T20:57:56Z) - Paint and Distill: Boosting 3D Object Detection with Semantic Passing
Network [70.53093934205057]
3D object detection task from lidar or camera sensors is essential for autonomous driving.
We propose a novel semantic passing framework, named SPNet, to boost the performance of existing lidar-based 3D detection models.
arXiv Detail & Related papers (2022-07-12T12:35:34Z) - Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data [80.14669385741202]
We propose a self-supervised pre-training method for 3D perception models tailored to autonomous driving data.
We leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups.
Our method does not require any point cloud nor image annotations.
arXiv Detail & Related papers (2022-03-30T12:40:30Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Learning 3D Semantic Segmentation with only 2D Image Supervision [18.785840615548473]
We train a 3D model from pseudo-labels derived from 2D semantic image segmentations using multiview fusion.
The proposed network architecture, 2D3DNet, achieves significantly better performance than baselines during experiments on a new urban dataset with lidar and images captured in 20 cities across 5 continents.
arXiv Detail & Related papers (2021-10-21T17:56:28Z) - Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z) - SemanticVoxels: Sequential Fusion for 3D Pedestrian Detection using
LiDAR Point Cloud and Semantic Segmentation [4.350338899049983]
We propose a generalization of PointPainting to be able to apply fusion at different levels.
We show that SemanticVoxels achieves state-of-the-art performance in both 3D and bird's eye view pedestrian detection benchmarks.
arXiv Detail & Related papers (2020-09-25T14:52:32Z) - Semantic sensor fusion: from camera to sparse lidar information [7.489722641968593]
This paper presents an approach to fuse different sensory information, Light Detection and Ranging (lidar) scans and camera images.
The transference of semantic information between the labelled image and the lidar point cloud is performed in four steps.
arXiv Detail & Related papers (2020-03-04T03:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.