Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
- URL: http://arxiv.org/abs/2508.07863v1
- Date: Mon, 11 Aug 2025 11:26:10 GMT
- Title: Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
- Authors: Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu,
- Abstract summary: Being-M0.5 is the first real-time, controllable vision-language-motion model that achieves performance across multiple motion generation tasks.<n>Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date.<n>We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation.
- Score: 67.8026841949812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.
Related papers
- PMG: Parameterized Motion Generator for Human-like Locomotion Control [14.637220434597168]
We develop a real-time motion generator that produces human-like locomotion in a single integrated system.<n>We show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputs.<n>These results establish a practical, experimentally validated pathway toward natural and deployable humanoid control.
arXiv Detail & Related papers (2026-02-13T06:38:04Z) - VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation [61.82502719679122]
We introduce VLNVerse, a benchmark for Versatile, Embodied, Realistic Simulation, and Evaluation.<n>VLNVerse redefines VLN as a scalable, full-stack embodied AI problem.<n>We propose a novel unified multi-task model capable of addressing all tasks within the benchmark.
arXiv Detail & Related papers (2025-12-22T04:27:26Z) - SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control [85.91101551600978]
We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements.<n>We build a foundation model for motion tracking by scaling along three axes: network size, dataset volume, and compute.<n>We show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces.
arXiv Detail & Related papers (2025-11-11T04:37:40Z) - Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data [26.595803661584032]
We push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot.<n>We introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences.<n>We propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation.
arXiv Detail & Related papers (2025-07-09T17:52:04Z) - PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation [35.563978243352764]
We introduce physical mapping, integrated throughout the human interaction generation pipeline.<n>Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space.<n>Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity.
arXiv Detail & Related papers (2025-06-09T06:04:49Z) - GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z) - λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics [11.901933884058021]
We introduce the LAMBDA benchmark-Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities.<n>Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings.<n>We find that learning methods, even when pretrained, yield lower success rates, while a neuro-symbolic method performs significantly better and requires less data.
arXiv Detail & Related papers (2024-11-28T19:31:50Z) - IMUDiffusion: A Diffusion Model for Multivariate Time Series Synthetisation for Inertial Motion Capturing Systems [0.0]
We propose IMUDiffusion, a probabilistic diffusion model specifically designed for time series generation.
Our approach enables the generation of high-quality time series sequences which accurately capture the dynamics of human activities.
In some cases, we are able to improve the macro F1-score by almost 30%.
arXiv Detail & Related papers (2024-11-05T09:53:52Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - Transformer Inertial Poser: Attention-based Real-time Human Motion
Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.