Learning Object Permanence from Video
- URL: http://arxiv.org/abs/2003.10469v4
- Date: Thu, 16 Jul 2020 09:16:04 GMT
- Title: Learning Object Permanence from Video
- Authors: Aviv Shamsian, Ofri Kleinfeld, Amir Globerson, Gal Chechik
- Abstract summary: This paper introduces the setup of learning Object Permanence from data.
We explain why this learning problem should be dissected into four components, where objects are visible, (2) occluded, (3) contained by another object and (4) carried by a containing object.
We then present a unified deep architecture that learns to predict object location under these four scenarios.
- Score: 46.34427538905761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object Permanence allows people to reason about the location of non-visible
objects, by understanding that they continue to exist even when not perceived
directly. Object Permanence is critical for building a model of the world,
since objects in natural visual scenes dynamically occlude and contain
each-other. Intensive studies in developmental psychology suggest that object
permanence is a challenging task that is learned through extensive experience.
Here we introduce the setup of learning Object Permanence from data. We explain
why this learning problem should be dissected into four components, where
objects are (1) visible, (2) occluded, (3) contained by another object and (4)
carried by a containing object. The fourth subtask, where a target object is
carried by a containing object, is particularly challenging because it requires
a system to reason about a moving location of an invisible object. We then
present a unified deep architecture that learns to predict object location
under these four scenarios. We evaluate the architecture and system on a new
dataset based on CATER, and find that it outperforms previous localization
methods and various baselines.
Related papers
- AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - The Background Also Matters: Background-Aware Motion-Guided Objects
Discovery [2.6442319761949875]
We propose a Background-aware Motion-guided Objects Discovery method.
We leverage masks of moving objects extracted from optical flow and design a learning mechanism to extend them to the true foreground.
This enables a joint learning of the objects discovery task and the object/non-object separation.
arXiv Detail & Related papers (2023-11-05T12:35:47Z) - Finding Fallen Objects Via Asynchronous Audio-Visual Integration [89.75296559813437]
This paper introduces a setting in which to study multi-modal object localization in 3D virtual environments.
An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics.
The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting.
arXiv Detail & Related papers (2022-07-07T17:59:59Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Bi-directional Object-context Prioritization Learning for Saliency
Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations.
We observe that spatial attention works concurrently with object-based attention in the human visual recognition system.
We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z) - SafePicking: Learning Safe Object Extraction via Object-Level Mapping [19.502587411252946]
We present a system, SafePicking, that integrates object-level mapping and learning-based motion planning.
Planning is done by learning a deep Q-network that receives observations of predicted poses and a depth-based heightmap to output a motion trajectory.
Our results show that the observation fusion of poses and depth-sensing gives both better performance and robustness to the model.
arXiv Detail & Related papers (2022-02-11T18:55:10Z) - Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs.
We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z) - Few-shot Object Grounding and Mapping for Natural Language Robot
Instruction Following [15.896892723068932]
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects.
We introduce a few-shot language-conditioned object grounding method trained from augmented reality data.
We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output.
arXiv Detail & Related papers (2020-11-14T20:35:20Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.