Towards Embodied Scene Description
- URL: http://arxiv.org/abs/2004.14638v2
- Date: Thu, 7 May 2020 09:14:37 GMT
- Title: Towards Embodied Scene Description
- Authors: Sinan Tan, Huaping Liu, Di Guo, Xinyu Zhang, Fuchun Sun
- Abstract summary: Embodiment is an important characteristic for all intelligent agents (creatures and robots)
We propose the Embodied Scene Description, which exploits the embodiment ability of the agent to find an optimal viewpoint in its environment for scene description tasks.
A learning framework with the paradigms of imitation learning and reinforcement learning is established to teach the intelligent agent to generate corresponding sensorimotor activities.
- Score: 36.17224570332247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embodiment is an important characteristic for all intelligent agents
(creatures and robots), while existing scene description tasks mainly focus on
analyzing images passively and the semantic understanding of the scenario is
separated from the interaction between the agent and the environment. In this
work, we propose the Embodied Scene Description, which exploits the embodiment
ability of the agent to find an optimal viewpoint in its environment for scene
description tasks. A learning framework with the paradigms of imitation
learning and reinforcement learning is established to teach the intelligent
agent to generate corresponding sensorimotor activities. The proposed framework
is tested on both the AI2Thor dataset and a real world robotic platform
demonstrating the effectiveness and extendability of the developed method.
Related papers
- Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Knowledge-enhanced Agents for Interactive Text Games [16.055119735473017]
We propose a knowledge-injection framework for improved functional grounding of agents in text-based games.
We consider two forms of domain knowledge that we inject into learning-based agents: memory of previous correct actions and affordances of relevant objects in the environment.
Our framework supports two representative model classes: reinforcement learning agents and language model agents.
arXiv Detail & Related papers (2023-05-08T23:31:39Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Embodied Agents for Efficient Exploration and Smart Scene Description [47.82947878753809]
We tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment.
We propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images.
Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions.
arXiv Detail & Related papers (2023-01-17T19:28:01Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - A Dynamic Data Driven Approach for Explainable Scene Understanding [0.0]
Scene-understanding is an important topic in the area of Computer Vision.
We consider the active explanation-driven understanding and classification of scenes.
Our framework is entitled ACUMEN: Active Classification and Understanding Method by Explanation-driven Networks.
arXiv Detail & Related papers (2022-06-18T02:41:51Z) - Stochastic Coherence Over Attention Trajectory For Continuous Learning
In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream.
The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations.
Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z) - Learning intuitive physics and one-shot imitation using
state-action-prediction self-organizing maps [0.0]
Humans learn by exploration and imitation, build causal models of the world, and use both to flexibly solve new tasks.
We suggest a simple but effective unsupervised model which develops such characteristics.
We demonstrate its performance on a set of several related, but different one-shot imitation tasks, which the agent flexibly solves in an active inference style.
arXiv Detail & Related papers (2020-07-03T12:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.