Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain
- URL: http://arxiv.org/abs/2501.18232v2
- Date: Mon, 10 Nov 2025 09:54:26 GMT
- Title: Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain
- Authors: Wenshuo Chen, Haozhe Jia, Songning Lai, Lei Wang, Yuqi Lin, Hongru Xiao, Lijie Hu, Yutao Yue,
- Abstract summary: This paper reframes the T2M problem from a frequency-domain perspective.<n>We introduce Frequency enhanced text-to-motion (Free-T2M), a framework incorporating stage-specific frequency-domain consistency alignment.<n>Extensive experiments show our method dramatically improves motion quality and semantic correctness.
- Score: 17.042533970366105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling humanoid robots to synthesize complex, physically coherent motions from natural language commands is a cornerstone of autonomous robotics and human-robot interaction. While diffusion models have shown promise in this text-to-motion (T2M) task, they often generate semantically flawed or unstable motions, limiting their applicability to real-world robots. This paper reframes the T2M problem from a frequency-domain perspective, revealing that the generative process mirrors a hierarchical control paradigm. We identify two critical phases: a semantic planning stage, where low-frequency components establish the global motion trajectory, and a fine-grained execution stage, where high-frequency details refine the movement. To address the distinct challenges of each phase, we introduce Frequency enhanced text-to-motion (Free-T2M), a framework incorporating stage-specific frequency-domain consistency alignment. We design a frequency-domain temporal-adaptive module to modulate the alignment effects of different frequency bands. These designs enforce robustness in the foundational semantic plan and enhance the accuracy of detailed execution. Extensive experiments show our method dramatically improves motion quality and semantic correctness. Notably, when applied to the StableMoFusion baseline, Free-T2M reduces the FID from 0.152 to 0.060, establishing a new state-of-the-art within diffusion architectures. These findings underscore the critical role of frequency-domain insights for generating robust and reliable motions, paving the way for more intuitive natural language control of robots.
Related papers
- Bridging the Sim-to-Real Gap with multipanda ros2: A Real-Time ROS2 Framework for Multimanual Systems [22.26675117934127]
We present $multipanda_ros2$, a novel open-source ROS2 architecture for multi-robot control of Franka Robotics robots.<n>Our core contributions address key challenges in real-time torque control, including interaction control and robot-environment modeling.
arXiv Detail & Related papers (2026-02-02T16:11:12Z) - T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z) - Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection [67.84730634802204]
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management.<n>Most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions.<n>We observe that frequency-domain feature modeling particularly in the wavelet domain amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain.
arXiv Detail & Related papers (2025-08-07T11:14:16Z) - WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval [7.349030413222046]
Text-Motion Retrieval aims to retrieve 3D motion sequences semantically relevant to text descriptions.<n>We propose WaMo, a novel wavelet-based multi-frequency feature extraction framework.<n>WaMo captures part-specific and time-varying motion details across multiple resolutions on body joints.
arXiv Detail & Related papers (2025-08-05T11:44:26Z) - FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens [47.735852718586216]
We propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components.<n>To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space.<n>Our approach outperforms existing methods in both accuracy and efficiency.
arXiv Detail & Related papers (2025-06-02T12:13:51Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.
To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.
Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Region-Adaptive Sampling for Diffusion Transformers [23.404921023113324]
RAS dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model.
We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality.
arXiv Detail & Related papers (2025-02-14T18:59:36Z) - Accelerate High-Quality Diffusion Models with Inner Loop Feedback [50.00066451431194]
Inner Loop Feedback (ILF) is a novel approach to accelerate diffusion models' inference.<n>ILF trains a lightweight module to predict future features in the denoising process.<n>ILF achieves strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma.
arXiv Detail & Related papers (2025-01-22T18:59:58Z) - FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion [63.609399000712905]
Inference at a scaled resolution leads to repetitive patterns and structural distortions.<n>We propose two simple modules that combine to solve these issues.<n>Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training.
arXiv Detail & Related papers (2024-11-27T17:51:44Z) - FTMoMamba: Motion Generation with Frequency and Text State Space Models [53.60865359814126]
We propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model and a Text State Space Model.
To learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components.
To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level.
arXiv Detail & Related papers (2024-11-26T15:48:12Z) - Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture [42.51987004849891]
Video Motion Magnification aims to reveal subtle and imperceptible motion information of objects in the macroscopic world.
We present FD4MM, a new paradigm of Frequency Decoupling for Motion Magnification with a Multi-level Isomorphic Architecture.
We show that FD4MM reduces FLOPs by 1.63$times$ and boosts inference speed by 1.68$times$ than the latest method.
arXiv Detail & Related papers (2024-03-12T06:07:29Z) - MotionMix: Weakly-Supervised Diffusion for Controllable Motion
Generation [19.999239668765885]
MotionMix is a weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences.
Our framework consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
arXiv Detail & Related papers (2024-01-20T04:58:06Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone.
Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z) - Fast Diffusion Model [122.36693015093041]
Diffusion models (DMs) have been adopted across diverse fields with their abilities in capturing intricate data distributions.
In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a DM optimization perspective.
arXiv Detail & Related papers (2023-06-12T09:38:04Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.