Improving Generative Imagination in Object-Centric World Models
- URL: http://arxiv.org/abs/2010.02054v1
- Date: Mon, 5 Oct 2020 14:43:13 GMT
- Title: Improving Generative Imagination in Object-Centric World Models
- Authors: Zhixuan Lin, Yi-Fu Wu, Skand Peri, Bofeng Fu, Jindong Jiang, Sungjin
Ahn
- Abstract summary: We introduce Generative Structured World Models (G-SWM)
G-SWM unifies the key properties of previous models in a principled framework.
It achieves two crucial new abilities, multimodal uncertainty and situation-awareness.
- Score: 20.495475118576604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable recent advances in object-centric generative world models
raise a few questions. First, while many of the recent achievements are
indispensable for making a general and versatile world model, it is quite
unclear how these ingredients can be integrated into a unified framework.
Second, despite using generative objectives, abilities for object detection and
tracking are mainly investigated, leaving the crucial ability of temporal
imagination largely under question. Third, a few key abilities for more
faithful temporal imagination such as multimodal uncertainty and
situation-awareness are missing. In this paper, we introduce Generative
Structured World Models (G-SWM). The G-SWM achieves the versatile world
modeling not only by unifying the key properties of previous models in a
principled framework but also by achieving two crucial new abilities,
multimodal uncertainty and situation-awareness. Our thorough investigation on
the temporal generation ability in comparison to the previous models
demonstrates that G-SWM achieves the versatility with the best or comparable
performance for all experiment settings including a few complex settings that
have not been tested before.
Related papers
- Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities [5.22475289121031]
Multimodal models are expected to be a critical component to future advances in artificial intelligence.
This work provides a fresh perspective on generalist multimodal models via a novel architecture and training configuration specific taxonomy.
arXiv Detail & Related papers (2024-06-08T15:30:46Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - On the Challenges and Opportunities in Generative AI [135.2754367149689]
We argue that current large-scale generative AI models do not sufficiently address several fundamental issues that hinder their widespread adoption across domains.
In this work, we aim to identify key unresolved challenges in modern generative AI paradigms that should be tackled to further enhance their capabilities, versatility, and reliability.
arXiv Detail & Related papers (2024-02-28T15:19:33Z) - The Essential Role of Causality in Foundation World Models for Embodied AI [102.75402420915965]
Embodied AI agents will require the ability to perform new tasks in many different real-world environments.
Current foundation models fail to accurately model physical interactions and are therefore insufficient for Embodied AI.
The study of causality lends itself to the construction of veridical world models.
arXiv Detail & Related papers (2024-02-06T17:15:33Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Stochastic Multi-Person 3D Motion Forecasting [21.915057426589744]
We deal with the ignored real-world complexities in prior work on human motion forecasting.
Our framework is general; we instantiate it with different generative models.
Our approach produces diverse and accurate multi-person predictions, significantly outperforming the state of the art.
arXiv Detail & Related papers (2023-06-08T17:59:09Z) - Multiscale Generative Models: Improving Performance of a Generative
Model Using Feedback from Other Dependent Generative Models [10.053377705165786]
We take a first step towards building interacting generative models (GANs) that reflects the interaction in real world.
We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs.
We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs.
arXiv Detail & Related papers (2022-01-24T13:05:56Z) - SyMetric: Measuring the Quality of Learnt Hamiltonian Dynamics Inferred
from Vision [73.26414295633846]
A recently proposed class of models attempts to learn latent dynamics from high-dimensional observations.
Existing methods rely on image reconstruction quality, which does not always reflect the quality of the learnt latent dynamics.
We develop a set of new measures, including a binary indicator of whether the underlying Hamiltonian dynamics have been faithfully captured.
arXiv Detail & Related papers (2021-11-10T23:26:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.