Related papers: Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds

Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds

URL: http://arxiv.org/abs/2502.05857v1
Date: Sun, 09 Feb 2025 11:28:57 GMT
Title: Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds
Authors: Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng,
Abstract summary: This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds.<n>We propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions with a single transformer.
Score: 107.62381002403814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, leading to information silos among them, which prevents these abilities from learning from each other and collaborating effectively. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions with a single transformer. EgoAgent unifies the representational spaces of the three abilities by mapping them all into a sequence of continuous tokens. Learnable query tokens are appended to obtain current states, future states, and next actions. With joint supervision, our agent model establishes the internal relationship among these three abilities and effectively mimics the human inference and learning processes. Comprehensive evaluations of EgoAgent covering image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.

Related papers

Poly-Autoregressive Prediction for Modeling Interactions [42.51313085280179]
We propose Poly-Autoregressive (PAR) modeling, which forecasts an ego agent's future behavior. We show that PAR can be applied to three different problems: human action forecasting in social situations, trajectory prediction for autonomous vehicles, and object pose forecasting during hand-object interaction.
arXiv Detail & Related papers (2025-02-12T18:59:43Z)
DIRIGENt: End-To-End Robotic Imitation of Human Demonstrations Based on a Diffusion Model [16.26334759935617]
We introduce DIRIGENt, a novel end-to-end diffusion approach to generate joint values from observing human demonstrations.<n>We create a dataset in which humans imitate a robot and then use this collected data to train a diffusion model that enables a robot to imitate humans.
arXiv Detail & Related papers (2025-01-28T09:05:03Z)
Ego-Foresight: Agent Visuomotor Prediction as Regularization for RL [34.6883445484835]
Ego-Foresight is a self-supervised method for disentangling agent and environment based on motion and prediction. We show that visuomotor prediction of the agent provides regularization to the RL algorithm, by encouraging the actions to stay within predictable bounds. We integrate Ego-Foresight with a model-free RL algorithm to solve simulated robotic manipulation tasks, showing an average improvement of 23% in efficiency and 8% in performance.
arXiv Detail & Related papers (2024-05-27T13:32:43Z)
Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z)
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation [65.46592503910875]
We investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. We learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents. We evaluate our methods on three challenging benchmarks with 2-4 agents.
arXiv Detail & Related papers (2024-04-16T17:59:11Z)
Humanoid Locomotion as Next Token Prediction [84.21335675130021]
Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
arXiv Detail & Related papers (2024-02-29T18:57:37Z)
Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework. These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z)
Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback [16.268581985382433]
An important goal in artificial intelligence is to create agents that can both interact naturally with humans and learn from their feedback. Here we demonstrate how to use reinforcement learning from human feedback to improve upon simulated, embodied agents.
arXiv Detail & Related papers (2022-11-21T16:00:31Z)
Learning Theory of Mind via Dynamic Traits Attribution [59.9781556714202]
We propose a new neural ToM architecture that learns to generate a latent trait vector of an actor from the past trajectories. This trait vector then multiplicatively modulates the prediction mechanism via a fast weights' scheme in the prediction neural network. We empirically show that the fast weights provide a good inductive bias to model the character traits of agents and hence improves mindreading ability.
arXiv Detail & Related papers (2022-04-17T11:21:18Z)
A-ACT: Action Anticipation through Cycle Transformations [89.83027919085289]
We take a step back to analyze how the human capability to anticipate the future can be transferred to machine learning algorithms. A recent study on human psychology explains that, in anticipating an occurrence, the human brain counts on both systems. In this work, we study the impact of each system for the task of action anticipation and introduce a paradigm to integrate them in a learning framework.
arXiv Detail & Related papers (2022-04-02T21:50:45Z)
Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters. We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
An active inference model of collective intelligence [0.0]
This paper posits a minimal agent-based model that simulates the relationship between local individual-level interaction and collective intelligence. Results show that stepwise cognitive transitions increase system performance by providing complementary mechanisms for alignment between agents' local and global optima.
arXiv Detail & Related papers (2021-04-02T14:32:01Z)
AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting [25.151713845738335]
We propose a new Transformer, AgentFormer, that jointly models the time and social dimensions. Based on AgentFormer, we propose a multi-agent trajectory prediction model that can attend to features of any agent at any previous timestep. Our method significantly improves the state of the art on well-established pedestrian and autonomous driving datasets.
arXiv Detail & Related papers (2021-03-25T17:59:01Z)
Imitating Interactive Intelligence [24.95842455898523]
We study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. We use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour.
arXiv Detail & Related papers (2020-12-10T13:55:47Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.