Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion
- URL: http://arxiv.org/abs/2407.03425v1
- Date: Wed, 3 Jul 2024 18:08:05 GMT
- Title: Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion
- Authors: Arthur Zhang, Rainier Heijne, Joydeep Biswas,
- Abstract summary: We propose LSMap, a method to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view.
Our model only requires a single RGBD image, does not require human labels, and operates in real time.
We show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion.
- Score: 7.781799395896687
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Autonomous mobile robots deployed in urban environments must be context-aware, i.e., able to distinguish between different semantic entities, and robust to occlusions. Current approaches like semantic scene completion (SSC) require pre-enumerating the set of classes and costly human annotations, while representation learning methods relax these assumptions but are not robust to occlusions and learn representations tailored towards auxiliary tasks. To address these limitations, we propose LSMap, a method that lifts masks from visual foundation models to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view (BEV) for the entire scene, including regions underneath dynamic entities and in occluded areas. Our model only requires a single RGBD image, does not require human labels, and operates in real time. We quantitatively demonstrate our approach outperforms existing models trained from scratch on semantic and elevation scene completion tasks with finetuning. Furthermore, we show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion. We evaluate our approach using CODa, a large-scale, real-world urban robot dataset. Supplementary visualizations, code, data, and pre-trained models, will be publicly available soon.
Related papers
- Evolved Hierarchical Masking for Self-Supervised Learning [49.77271430882176]
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training.
This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning.
arXiv Detail & Related papers (2025-04-12T09:40:14Z) - Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Object-level Scene Deocclusion [92.39886029550286]
We present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, for object-level scene deocclusion.
To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning.
Experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin.
arXiv Detail & Related papers (2024-06-11T20:34:10Z) - Annotation Free Semantic Segmentation with Vision Foundation Models [11.026377387506216]
We generate free annotations for any semantic segmentation dataset using existing foundation models.
We use CLIP to detect objects and SAM to generate high quality object masks.
Our approach can bring language-based semantics to any pretrained vision encoder with minimal training.
arXiv Detail & Related papers (2024-03-14T11:57:58Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Learning with a Mole: Transferable latent spatial representations for
navigation without reconstruction [12.845774297648736]
In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation.
In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task.
The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint.
arXiv Detail & Related papers (2023-06-06T16:51:43Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - Towards Single Stage Weakly Supervised Semantic Segmentation [2.28438857884398]
We present a single-stage approach to weakly supervised semantic segmentation.
We use point annotations to generate reliable, on-the-fly pseudo-masks.
We significantly outperform other SOTA WSSS methods on recent real-world datasets.
arXiv Detail & Related papers (2021-06-18T18:34:50Z) - Hidden Footprints: Learning Contextual Walkability from 3D Human Trails [70.01257397390361]
Current datasets only tell you where people are, not where they could be.
We first augment the set of valid, labeled walkable regions by propagating person observations between images, utilizing 3D information to create what we call hidden footprints.
We devise a training strategy designed for such sparse labels, combining a class-balanced classification loss with a contextual adversarial loss.
arXiv Detail & Related papers (2020-08-19T23:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.