Related papers: ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation

URL: http://arxiv.org/abs/2512.16234v1
Date: Thu, 18 Dec 2025 06:28:42 GMT
Title: ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
Authors: Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian,
Abstract summary: 3D human reaction generation faces three main challenges: high motion fidelity, real-time inference, and autoregressive adaptability for online scenarios.<n>We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between motions and velocity.<n>Our single-step online generation surpasses existing methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
Score: 48.716675019745885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.

Related papers

HybridFlow: A Two-Step Generative Policy for Robotic Manipulation [2.2200541495683996]
MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation.<n>HybridFlow balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation.<n>We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
arXiv Detail & Related papers (2026-02-14T10:50:23Z)
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge [11.016302257907936]
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control.<n>Current VLA models operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding.<n>We introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms.
arXiv Detail & Related papers (2025-12-23T11:29:03Z)
One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow [56.13949180229929]
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow.<n>Our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.
arXiv Detail & Related papers (2025-11-17T06:34:17Z)
INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers [61.84396402100827]
We propose the Indirect Neural Corrector ($mathrmINC$), which integrates learned corrections into the governing equations.<n>$mathrmINC$ reduces the error amplification on the order of $t-1 + L$, where $t$ is the timestep and $L$ the Lipschitz constant.<n>We test $mathrmINC$ in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence.
arXiv Detail & Related papers (2025-11-16T20:14:28Z)
MeanFlowSE: one-step generative speech enhancement via conditional mean flow [13.437825847370442]
MeanFlowSE is a conditional generative model that learns the average velocity over finite intervals along a trajectory.<n>On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines.
arXiv Detail & Related papers (2025-09-18T11:24:47Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences.<n>We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.<n>Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z)
AdaFlow: Imitation Learning with Variance-Adaptive Flow-Based Policies [21.024480978703288]
We propose AdaFlow, an imitation learning framework based on flow-based generative modeling. AdaFlow represents the policy with state-conditioned ordinary differential equations (ODEs) We show that AdaFlow achieves high performance with fast inference speed.
arXiv Detail & Related papers (2024-02-06T10:15:38Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.