GATSBI: Generative Agent-centric Spatio-temporal Object Interaction
- URL: http://arxiv.org/abs/2104.04275v1
- Date: Fri, 9 Apr 2021 09:45:00 GMT
- Title: GATSBI: Generative Agent-centric Spatio-temporal Object Interaction
- Authors: Cheol-Hui Min, Jinseok Bae, Junho Lee and Young Min Kim
- Abstract summary: GAT SBI is a generative model that transforms a sequence of raw observations into a structured representation.
We show GAT SBI achieves superior on scene decomposition and video prediction compared to its state-of-the-art counterparts.
- Score: 9.328991021103294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present GATSBI, a generative model that can transform a sequence of raw
observations into a structured latent representation that fully captures the
spatio-temporal context of the agent's actions. In vision-based decision-making
scenarios, an agent faces complex high-dimensional observations where multiple
entities interact with each other. The agent requires a good scene
representation of the visual observation that discerns essential components and
consistently propagates along the time horizon. Our method, GATSBI, utilizes
unsupervised object-centric scene representation learning to separate an active
agent, static background, and passive objects. GATSBI then models the
interactions reflecting the causal relationships among decomposed entities and
predicts physically plausible future states. Our model generalizes to a variety
of environments where different types of robots and objects dynamically
interact with each other. We show GATSBI achieves superior performance on scene
decomposition and video prediction compared to its state-of-the-art
counterparts.
Related papers
- Learning Collective Dynamics of Multi-Agent Systems using Event-based Vision [15.26086907502649]
This paper proposes a novel problem: vision-based perception to learn and predict the collective dynamics of multi-agent systems.
We focus on deep learning models to directly predict collective dynamics from visual data, captured as frames or events.
We empirically demonstrate the effectiveness of event-based representation over traditional frame-based methods in predicting these collective behaviors.
arXiv Detail & Related papers (2024-11-11T14:45:47Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Scaling Up Dynamic Human-Scene Interaction Modeling [58.032368564071895]
TRUMANS is the most comprehensive motion-captured HSI dataset currently available.
It intricately captures whole-body human motions and part-level object dynamics.
We devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length.
arXiv Detail & Related papers (2024-03-13T15:45:04Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Leveraging Next-Active Objects for Context-Aware Anticipation in
Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA)
We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions.
Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Multi-Agent Imitation Learning with Copulas [102.27052968901894]
Multi-agent imitation learning aims to train multiple agents to perform tasks from demonstrations by learning a mapping between observations and actions.
In this paper, we propose to use copula, a powerful statistical tool for capturing dependence among random variables, to explicitly model the correlation and coordination in multi-agent systems.
Our proposed model is able to separately learn marginals that capture the local behavioral patterns of each individual agent, as well as a copula function that solely and fully captures the dependence structure among agents.
arXiv Detail & Related papers (2021-07-10T03:49:41Z) - SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
Unsupervised Video Decomposition [69.90530987240899]
We present an unsupervised variational approach to this problem.
Our model learns to infer two sets of latent representations from RGB video input alone.
It represents object attributes in an allocentric manner which does not depend on viewpoint.
arXiv Detail & Related papers (2021-06-07T17:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.