GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
- URL: http://arxiv.org/abs/2412.11198v1
- Date: Sun, 15 Dec 2024 14:21:19 GMT
- Title: GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
- Authors: Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi,
- Abstract summary: We present GEM, a Generalizable Ego-vision Multimodal world model.
It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.
Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
- Score: 122.65089441381741
- License:
- Abstract: We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.
Related papers
- Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [28.825612240280822]
We propose a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control.
Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions.
We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation.
arXiv Detail & Related papers (2025-02-20T18:17:11Z) - EgoLM: Multi-Modal Language Model of Egocentric Motions [42.36945117610459]
We present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs.
Our key insight is to model the joint distribution of egocentric motions and natural languages using large language models.
arXiv Detail & Related papers (2024-09-26T17:59:31Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free
Environment [59.320414108383055]
We present LiveHPS, a novel single-LiDAR-based approach for scene-level human pose and shape estimation.
We propose a huge human motion dataset, named FreeMotion, which is collected in various scenarios with diverse human poses.
arXiv Detail & Related papers (2024-02-27T03:08:44Z) - Aria-NeRF: Multimodal Egocentric View Synthesis [17.0554791846124]
We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs)
This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, and information from dual-frequency IMU datasets (1kHz and 800Hz)
The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in
arXiv Detail & Related papers (2023-11-11T01:56:35Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical
VAE [37.23381308240617]
We propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms.
We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.
arXiv Detail & Related papers (2022-04-04T15:12:34Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.