Instance Segmentation with Cross-Modal Consistency
- URL: http://arxiv.org/abs/2210.08113v1
- Date: Fri, 14 Oct 2022 21:17:19 GMT
- Title: Instance Segmentation with Cross-Modal Consistency
- Authors: Alex Zihao Zhu, Vincent Casser, Reza Mahjourian, Henrik Kretzschmar,
S\"oren Pirk
- Abstract summary: We introduce a novel approach to instance segmentation that jointly leverages measurements from multiple sensor modalities.
Our technique applies contrastive learning to points in the scene both across sensor modalities and the temporal domain.
We demonstrate that this formulation encourages the models to learn embeddings that are invariant to viewpoint variations.
- Score: 13.524441194366544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmenting object instances is a key task in machine perception, with
safety-critical applications in robotics and autonomous driving. We introduce a
novel approach to instance segmentation that jointly leverages measurements
from multiple sensor modalities, such as cameras and LiDAR. Our method learns
to predict embeddings for each pixel or point that give rise to a dense
segmentation of the scene. Specifically, our technique applies contrastive
learning to points in the scene both across sensor modalities and the temporal
domain. We demonstrate that this formulation encourages the models to learn
embeddings that are invariant to viewpoint variations and consistent across
sensor modalities. We further demonstrate that the embeddings are stable over
time as objects move around the scene. This not only provides stable instance
masks, but can also provide valuable signals to downstream tasks, such as
object tracking. We evaluate our method on the Cityscapes and KITTI-360
datasets. We further conduct a number of ablation studies, demonstrating
benefits when applying additional inputs for the contrastive loss.
Related papers
- 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Simultaneous Clutter Detection and Semantic Segmentation of Moving
Objects for Automotive Radar Data [12.96486891333286]
Radar sensors are an important part of the environment perception system of autonomous vehicles.
One of the first steps during the processing of radar point clouds is often the detection of clutter.
Another common objective is the semantic segmentation of moving road users.
We show that our setup is highly effective and outperforms every existing network for semantic segmentation on the RadarScenes dataset.
arXiv Detail & Related papers (2023-11-13T11:29:38Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation [111.61261419566908]
Deep neural networks (DNNs) are usually trained on a closed set of semantic classes.
They are ill-equipped to handle previously-unseen objects.
detecting and localizing such objects is crucial for safety-critical applications such as perception for automated driving.
arXiv Detail & Related papers (2021-04-30T07:58:19Z) - 4D Panoptic LiDAR Segmentation [27.677435778317054]
We propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points.
Inspired by recent advances in benchmarking of multi-object tracking, we propose to adopt a new evaluation metric that separates the semantic and point-to-instance association of the task.
arXiv Detail & Related papers (2021-02-24T18:56:16Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - Learning Invariant Representations for Reinforcement Learning without
Reconstruction [98.33235415273562]
We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction.
Bisimulation metrics quantify behavioral similarity between states in continuous MDPs.
We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks.
arXiv Detail & Related papers (2020-06-18T17:59:35Z) - Deep Soft Procrustes for Markerless Volumetric Sensor Alignment [81.13055566952221]
In this work, we improve markerless data-driven correspondence estimation to achieve more robust multi-sensor spatial alignment.
We incorporate geometric constraints in an end-to-end manner into a typical segmentation based model and bridge the intermediate dense classification task with the targeted pose estimation one.
Our model is experimentally shown to achieve similar results with marker-based methods and outperform the markerless ones, while also being robust to the pose variations of the calibration structure.
arXiv Detail & Related papers (2020-03-23T10:51:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.