Related papers: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

URL: http://arxiv.org/abs/2509.25304v1
Date: Mon, 29 Sep 2025 17:58:28 GMT
Title: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model
Authors: Haozhe Jia, Wenshuo Chen, Yuqi Lin, Yang Yang, Lei Wang, Mang Ning, Bowen Tian, Songning Lai, Nanqian Jia, Yifan Chen, Yutao Yue,
Abstract summary: We propose a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment.<n>LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively.
Score: 18.564067196226436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.

Related papers

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers [31.67315012315044]
We introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states.<n>Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy.<n>We find that purely time-wise fusion can paradoxically degrade visual generation fidelity.
arXiv Detail & Related papers (2026-02-03T13:30:13Z)
T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation [3.6564162676635363]
Text-to-motion generation has attracted increasing attention in fields such as avatar animation and humanoid robotic interaction.<n>Models treat motion periodicity and saliency as independent factors, overlooking their coupling and causing generation drift in long sequences.<n>We propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba.
arXiv Detail & Related papers (2026-02-01T17:42:53Z)
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion [11.138412313646995]
We introduce REGLUE, a unified latent diffusion framework.<n>A lightweight convolutional semantic nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation.<n>On ImageNet 256x256, REGLUE consistently improves FID and convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG.
arXiv Detail & Related papers (2025-12-18T15:10:42Z)
DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z)
Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment [6.124050993047708]
WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for AIoT environments.<n>We propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment.<n>Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.
arXiv Detail & Related papers (2025-10-15T10:28:50Z)
WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval [7.349030413222046]
Text-Motion Retrieval aims to retrieve 3D motion sequences semantically relevant to text descriptions.<n>We propose WaMo, a novel wavelet-based multi-frequency feature extraction framework.<n>WaMo captures part-specific and time-varying motion details across multiple resolutions on body joints.
arXiv Detail & Related papers (2025-08-05T11:44:26Z)
CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects [2.321156185872456]
We propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization.<n>First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion.<n>Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions.
arXiv Detail & Related papers (2025-06-11T16:13:38Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [32.47567372398872]
GestureLSM is a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling.<n>It achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods.
arXiv Detail & Related papers (2025-01-31T05:34:59Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing [23.598273691455503]
We propose a novel Scale-Semantic Joint Decoupling Network (SJDN) for remote sensing image-text retrieval. Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.
arXiv Detail & Related papers (2022-12-12T08:02:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.