Related papers: OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using Semantic Understanding in Mixed Reality

OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using Semantic Understanding in Mixed Reality

URL: http://arxiv.org/abs/2312.12815v1
Date: Wed, 20 Dec 2023 07:34:20 GMT
Title: OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using Semantic Understanding in Mixed Reality
Authors: Luke Yoffe, Aditya Sharma, Tobias H\"ollerer
Abstract summary: We introduce a new open-vocabulary method for object placement in augmented reality. In a preliminary user study, we show that our method performs at least as well as human experts 57% of the time.
Score: 3.469644923522024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One key challenge in augmented reality is the placement of virtual content in natural locations. Existing automated techniques are only able to work with a closed-vocabulary, fixed set of objects. In this paper, we introduce a new open-vocabulary method for object placement. Our eight-stage pipeline leverages recent advances in segmentation models, vision-language models, and LLMs to place any virtual object in any AR camera frame or scene. In a preliminary user study, we show that our method performs at least as well as human experts 57% of the time.

Related papers

DynVFX: Augmenting Real Videos with Dynamic Content [19.393567535259518]
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects. The position, appearance, and motion of the new content are seamlessly integrated into the original footage.
arXiv Detail & Related papers (2025-02-05T21:14:55Z)
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO) An object grounding task is proposed expecting vision systems to discover interacted objects. We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [50.16478515591924]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task. We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark. Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z)
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z)
OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding [21.64446104872021]
We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding. In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
arXiv Detail & Related papers (2024-06-12T08:59:33Z)
OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality [3.469644923522024]
We introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. We find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics.
arXiv Detail & Related papers (2024-01-17T04:52:40Z)
Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain. We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z)
One-shot Imitation Learning via Interaction Warping [32.5466340846254]
We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration. We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances. We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks.
arXiv Detail & Related papers (2023-06-21T17:26:11Z)
Ditto in the House: Building Articulation Models of Indoor Scenes through Interactive Perception [31.009703947432026]
This work explores building articulation models of indoor scenes through a robot's purposeful interactions. We introduce an interactive perception approach to this task. We demonstrate the effectiveness of our approach in both simulation and real-world scenes.
arXiv Detail & Related papers (2023-02-02T18:22:00Z)
Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions [74.63313641583602]
This paper studies the task of any objects grasping from the known categories by free-form language instructions. We bring these disciplines together on this open challenge, which is essential to human-robot interaction. We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
arXiv Detail & Related papers (2022-05-09T04:25:14Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.