Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
- URL: http://arxiv.org/abs/2504.08449v1
- Date: Fri, 11 Apr 2025 11:18:57 GMT
- Title: Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
- Authors: Jian Wang, Rishabh Dabral, Diogo Luvizon, Zhe Cao, Lingjie Liu, Thabo Beeler, Christian Theobalt,
- Abstract summary: This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches.<n>We present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs.
- Score: 62.51283548975632
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work focuses on tracking and understanding human motion using consumer wearable devices, such as VR/AR headsets, smart glasses, cellphones, and smartwatches. These devices provide diverse, multi-modal sensor inputs, including egocentric images, and 1-3 sparse IMU sensors in varied combinations. Motion descriptions can also accompany these signals. The diverse input modalities and their intermittent availability pose challenges for consistent motion capture and understanding. In this work, we present Ego4o (o for omni), a new framework for simultaneous human motion capture and understanding from multi-modal egocentric inputs. This method maintains performance with partial inputs while achieving better results when multiple modalities are combined. First, the IMU sensor inputs, the optional egocentric image, and text description of human motion are encoded into the latent space of a motion VQ-VAE. Next, the latent vectors are sent to the VQ-VAE decoder and optimized to track human motion. When motion descriptions are unavailable, the latent vectors can be input into a multi-modal LLM to generate human motion descriptions, which can further enhance motion capture accuracy. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in predicting accurate human motion and high-quality motion descriptions.
Related papers
- Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens [37.26990830273303]
Inertial measurement units (IMUs) offer lightweight, wearable, and privacy-conscious motion sensing.<n> processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift.<n>We introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models for interactive motion capture and behavioral analysis.
arXiv Detail & Related papers (2025-02-22T10:31:58Z) - Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs [16.41735119504929]
This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion.<n>The input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals.<n>We introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations.
arXiv Detail & Related papers (2025-02-08T17:02:11Z) - Motion Capture from Inertial and Vision Sensors [60.5190090684795]
MINIONS is a large-scale Motion capture dataset collected from INertial and visION Sensors.
We conduct experiments on multi-modal motion capture using a monocular camera and very few IMUs.
arXiv Detail & Related papers (2024-07-23T09:41:10Z) - MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding.
We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - Human MotionFormer: Transferring Human Motions with Vision Transformers [73.48118882676276]
Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis.
We propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching.
Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-02-22T11:42:44Z) - Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs.
Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z) - QuestSim: Human Motion Tracking from Sparse Sensors with Simulated
Avatars [80.05743236282564]
Real-time tracking of human body motion is crucial for immersive experiences in AR/VR.
We present a reinforcement learning framework that takes in sparse signals from an HMD and two controllers.
We show that a single policy can be robust to diverse locomotion styles, different body sizes, and novel environments.
arXiv Detail & Related papers (2022-09-20T00:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.