Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models
- URL: http://arxiv.org/abs/2311.10883v1
- Date: Fri, 17 Nov 2023 21:58:26 GMT
- Title: Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models
- Authors: Yimeng Li, Navid Rajabi, Sulabh Shrestha, Md Alimoor Reza, and Jana
Kosecka
- Abstract summary: We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer)
We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments.
We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
- Score: 4.157013247909771
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The image annotation stage is a critical and often the most time-consuming
part required for training and evaluating object detection and semantic
segmentation models. Deployment of the existing models in novel environments
often requires detecting novel semantic classes not present in the training
data. Furthermore, indoor scenes contain significant viewpoint variations,
which need to be handled properly by trained perception models. We propose to
leverage the recent advancements in state-of-the-art models for bottom-up
segmentation (SAM), object detection (Detic), and semantic segmentation
(MaskFormer), all trained on large-scale datasets. We aim to develop a
cost-effective labeling approach to obtain pseudo-labels for semantic
segmentation and object instance detection in indoor environments, with the
ultimate goal of facilitating the training of lightweight models for various
downstream tasks. We also propose a multi-view labeling fusion stage, which
considers the setting where multiple views of the scenes are available and can
be used to identify and rectify single-view inconsistencies. We demonstrate the
effectiveness of the proposed approach on the Active Vision dataset and the
ADE20K dataset. We evaluate the quality of our labeling process by comparing it
with human annotations. Also, we demonstrate the effectiveness of the obtained
labels in downstream tasks such as object goal navigation and part discovery.
In the context of object goal navigation, we depict enhanced performance using
this fusion approach compared to a zero-shot baseline that utilizes large
monolithic vision-language pre-trained models.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Self-supervised Pre-training for Semantic Segmentation in an Indoor
Scene [8.357801312689622]
We propose RegConsist, a method for self-supervised pre-training of a semantic segmentation model.
We use a variant of contrastive learning to train a DCNN model for predicting semantic segmentation from RGB views in the target environment.
The proposed method outperforms models pre-trained on ImageNet and achieves competitive performance when using models that are trained for exactly the same task but on a different dataset.
arXiv Detail & Related papers (2022-10-04T20:10:14Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.