ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
- URL: http://arxiv.org/abs/2512.16234v1
- Date: Thu, 18 Dec 2025 06:28:42 GMT
- Title: ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
- Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian,
- Abstract summary: 3D human reaction generation faces three main challenges: high motion fidelity, real-time inference, and autoregressive adaptability for online scenarios.<n>We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between motions and velocity.<n>Our single-step online generation surpasses existing methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
- Score: 48.716675019745885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
Related papers
- HybridFlow: A Two-Step Generative Policy for Robotic Manipulation [2.2200541495683996]
MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation.<n>HybridFlow balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation.<n>We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
arXiv Detail & Related papers (2026-02-14T10:50:23Z) - ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge [11.016302257907936]
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control.<n>Current VLA models operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding.<n>We introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms.
arXiv Detail & Related papers (2025-12-23T11:29:03Z) - One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow [56.13949180229929]
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow.<n>Our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.
arXiv Detail & Related papers (2025-11-17T06:34:17Z) - INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers [61.84396402100827]
We propose the Indirect Neural Corrector ($mathrmINC$), which integrates learned corrections into the governing equations.<n>$mathrmINC$ reduces the error amplification on the order of $t-1 + L$, where $t$ is the timestep and $L$ the Lipschitz constant.<n>We test $mathrmINC$ in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence.
arXiv Detail & Related papers (2025-11-16T20:14:28Z) - MeanFlowSE: one-step generative speech enhancement via conditional mean flow [13.437825847370442]
MeanFlowSE is a conditional generative model that learns the average velocity over finite intervals along a trajectory.<n>On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines.
arXiv Detail & Related papers (2025-09-18T11:24:47Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences.<n>We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.<n>Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z) - AdaFlow: Imitation Learning with Variance-Adaptive Flow-Based Policies [21.024480978703288]
We propose AdaFlow, an imitation learning framework based on flow-based generative modeling.
AdaFlow represents the policy with state-conditioned ordinary differential equations (ODEs)
We show that AdaFlow achieves high performance with fast inference speed.
arXiv Detail & Related papers (2024-02-06T10:15:38Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.
FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.