Related papers: PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

URL: http://arxiv.org/abs/2506.17912v1
Date: Sun, 22 Jun 2025 06:24:53 GMT
Title: PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis
Authors: Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang,
Abstract summary: PlanMoGPT is an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization.<n>It achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation.<n>The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches.
Score: 40.338618060111116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.

Related papers

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
LSR-MCTS: Alleviating Long Range Dependency in Code Generation [42.10272627826627]
Large language models (LLMs) have significantly promoted the development of code generation task.<n>We propose the textbfLSR-MCTS algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path.
arXiv Detail & Related papers (2025-04-10T04:03:25Z)
Unlocking Pretrained LLMs for Motion-Related Multimodal Generation: A Fine-Tuning Approach to Unify Diffusion and Next-Token Prediction [2.997267793609097]
MoMug integrates diffusion-based continuous motion generation with the model's inherent autoregressive discrete text prediction capabilities.<n>MoMug improves FID by 38% and mean accuracy across seven metrics by 16.61% on the text-to-motion task.
arXiv Detail & Related papers (2025-03-08T08:16:16Z)
LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning [38.35238373706948]
LeMo is a new LLM fine-tuning system that exploits a new token-level sparsity mechanism inherent in long-context scenarios.<n>LeMo reduces memory consumption by up to 1.93x and achieves up to 1.36x speedups, outperforming state-of-the-art fine-tuning systems.
arXiv Detail & Related papers (2025-01-15T05:17:12Z)
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z)
Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated. We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
LLM3:Large Language Model-based Task and Motion Planning with Motion Failure Reasoning [78.2390460278551]
Conventional Task and Motion Planning (TAMP) approaches rely on manually crafted interfaces connecting symbolic task planning with continuous motion generation. Here, we present LLM3, a novel Large Language Model (LLM)-based TAMP framework featuring a domain-independent interface. Specifically, we leverage the powerful reasoning and planning capabilities of pre-trained LLMs to propose symbolic action sequences and select continuous action parameters for motion planning.
arXiv Detail & Related papers (2024-03-18T08:03:47Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.