OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using
Semantic Understanding in Mixed Reality
- URL: http://arxiv.org/abs/2312.12815v1
- Date: Wed, 20 Dec 2023 07:34:20 GMT
- Title: OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using
Semantic Understanding in Mixed Reality
- Authors: Luke Yoffe, Aditya Sharma, Tobias H\"ollerer
- Abstract summary: We introduce a new open-vocabulary method for object placement in augmented reality.
In a preliminary user study, we show that our method performs at least as well as human experts 57% of the time.
- Score: 3.469644923522024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One key challenge in augmented reality is the placement of virtual content in
natural locations. Existing automated techniques are only able to work with a
closed-vocabulary, fixed set of objects. In this paper, we introduce a new
open-vocabulary method for object placement. Our eight-stage pipeline leverages
recent advances in segmentation models, vision-language models, and LLMs to
place any virtual object in any AR camera frame or scene. In a preliminary user
study, we show that our method performs at least as well as human experts 57%
of the time.
Related papers
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding [21.64446104872021]
We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding.
In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level.
The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
arXiv Detail & Related papers (2024-06-12T08:59:33Z) - MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting [106.53784213239479]
We present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs vision language models to solve robotic manipulation tasks.
At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world.
We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed
Reality [3.469644923522024]
We introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models.
We find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics.
arXiv Detail & Related papers (2024-01-17T04:52:40Z) - Generating Action-conditioned Prompts for Open-vocabulary Video Action
Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain.
We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition.
Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - One-shot Imitation Learning via Interaction Warping [32.5466340846254]
We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration.
We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances.
We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks.
arXiv Detail & Related papers (2023-06-21T17:26:11Z) - Ditto in the House: Building Articulation Models of Indoor Scenes
through Interactive Perception [31.009703947432026]
This work explores building articulation models of indoor scenes through a robot's purposeful interactions.
We introduce an interactive perception approach to this task.
We demonstrate the effectiveness of our approach in both simulation and real-world scenes.
arXiv Detail & Related papers (2023-02-02T18:22:00Z) - Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions [74.63313641583602]
This paper studies the task of any objects grasping from the known categories by free-form language instructions.
We bring these disciplines together on this open challenge, which is essential to human-robot interaction.
We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
arXiv Detail & Related papers (2022-05-09T04:25:14Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.