Continuous Scene Representations for Embodied AI
- URL: http://arxiv.org/abs/2203.17251v1
- Date: Thu, 31 Mar 2022 17:55:33 GMT
- Title: Continuous Scene Representations for Embodied AI
- Authors: Samir Yitzhak Gadre, Kiana Ehsani, Shuran Song, Roozbeh Mottaghi
- Abstract summary: Continuous Scene Representations (CSR) is a scene representation constructed by an embodied agent navigating within a space.
Our key insight is to embed pair-wise relationships between objects in a latent space.
CSR can track objects as the agent moves in a scene, update the representation accordingly, and detect changes in room configurations.
- Score: 33.00565252990522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Continuous Scene Representations (CSR), a scene representation
constructed by an embodied agent navigating within a space, where objects and
their relationships are modeled by continuous valued embeddings. Our method
captures feature relationships between objects, composes them into a graph
structure on-the-fly, and situates an embodied agent within the representation.
Our key insight is to embed pair-wise relationships between objects in a latent
space. This allows for a richer representation compared to discrete relations
(e.g., [support], [next-to]) commonly used for building scene representations.
CSR can track objects as the agent moves in a scene, update the representation
accordingly, and detect changes in room configurations. Using CSR, we
outperform state-of-the-art approaches for the challenging downstream task of
visual room rearrangement, without any task specific training. Moreover, we
show the learned embeddings capture salient spatial details of the scene and
show applicability to real world data. A summery video and code is available at
https://prior.allenai.org/projects/csr.
Related papers
- Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Situational Scene Graph for Structured Human-centric Situation Understanding [15.91717913059569]
We propose a graph-based representation called Situational Scene Graph (SSG) to encode both humanobject-relationships and the corresponding semantic properties.
The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action.
We will release the code and the dataset soon.
arXiv Detail & Related papers (2024-10-30T09:11:25Z) - VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
Unsupervised Video Decomposition [69.90530987240899]
We present an unsupervised variational approach to this problem.
Our model learns to infer two sets of latent representations from RGB video input alone.
It represents object attributes in an allocentric manner which does not depend on viewpoint.
arXiv Detail & Related papers (2021-06-07T17:59:23Z) - Transformed ROIs for Capturing Visual Transformations in Videos [31.88528313257094]
We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time.
We achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and Epic-Kitchens-100.
arXiv Detail & Related papers (2021-06-06T15:59:53Z) - SceneGen: Generative Contextual Scene Augmentation using Scene Graph
Priors [3.1969855247377827]
We introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes.
SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content.
We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room.
To demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.
arXiv Detail & Related papers (2020-09-25T18:36:27Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - RELATE: Physically Plausible Multi-Object Scene Synthesis Using
Structured Latent Spaces [77.07767833443256]
We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects.
In contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity.
arXiv Detail & Related papers (2020-07-02T17:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.