Related papers: Learning Object Permanence from Video

Learning Object Permanence from Video

URL: http://arxiv.org/abs/2003.10469v4
Date: Thu, 16 Jul 2020 09:16:04 GMT
Title: Learning Object Permanence from Video
Authors: Aviv Shamsian, Ofri Kleinfeld, Amir Globerson, Gal Chechik
Abstract summary: This paper introduces the setup of learning Object Permanence from data. We explain why this learning problem should be dissected into four components, where objects are visible, (2) occluded, (3) contained by another object and (4) carried by a containing object. We then present a unified deep architecture that learns to predict object location under these four scenarios.
Score: 46.34427538905761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object Permanence allows people to reason about the location of non-visible objects, by understanding that they continue to exist even when not perceived directly. Object Permanence is critical for building a model of the world, since objects in natural visual scenes dynamically occlude and contain each-other. Intensive studies in developmental psychology suggest that object permanence is a challenging task that is learned through extensive experience. Here we introduce the setup of learning Object Permanence from data. We explain why this learning problem should be dissected into four components, where objects are (1) visible, (2) occluded, (3) contained by another object and (4) carried by a containing object. The fourth subtask, where a target object is carried by a containing object, is particularly challenging because it requires a system to reason about a moving location of an invisible object. We then present a unified deep architecture that learns to predict object location under these four scenarios. We evaluate the architecture and system on a new dataset based on CATER, and find that it outperforms previous localization methods and various baselines.

Related papers

Object Learning and Robust 3D Reconstruction [7.092348056331202]
We discuss architectural designs and training methods for a neural network to dissect an image into objects of interest without supervision. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. We leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects.
arXiv Detail & Related papers (2025-04-22T21:48:31Z)
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO) An object grounding task is proposed expecting vision systems to discover interacted objects. We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z)
AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z)
The Background Also Matters: Background-Aware Motion-Guided Objects Discovery [2.6442319761949875]
We propose a Background-aware Motion-guided Objects Discovery method. We leverage masks of moving objects extracted from optical flow and design a learning mechanism to extend them to the true foreground. This enables a joint learning of the objects discovery task and the object/non-object separation.
arXiv Detail & Related papers (2023-11-05T12:35:47Z)
Finding Fallen Objects Via Asynchronous Audio-Visual Integration [89.75296559813437]
This paper introduces a setting in which to study multi-modal object localization in 3D virtual environments. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting.
arXiv Detail & Related papers (2022-07-07T17:59:59Z)
Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)
Bi-directional Object-context Prioritization Learning for Saliency Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations. We observe that spatial attention works concurrently with object-based attention in the human visual recognition system. We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z)
SafePicking: Learning Safe Object Extraction via Object-Level Mapping [19.502587411252946]
We present a system, SafePicking, that integrates object-level mapping and learning-based motion planning. Planning is done by learning a deep Q-network that receives observations of predicted poses and a depth-based heightmap to output a motion trajectory. Our results show that the observation fusion of poses and depth-sensing gives both better performance and robustness to the model.
arXiv Detail & Related papers (2022-02-11T18:55:10Z)
Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs. We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z)
Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following [15.896892723068932]
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output.
arXiv Detail & Related papers (2020-11-14T20:35:20Z)
Look-into-Object: Self-supervised Structure Modeling for Object Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions. We show the recognition backbone can be substantially enhanced for more robust representation learning. Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.