DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
- URL: http://arxiv.org/abs/2411.19527v3
- Date: Fri, 18 Apr 2025 08:49:08 GMT
- Title: DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
- Authors: Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu,
- Abstract summary: We introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method to decode discrete motion tokens in the continuous, raw motion space.<n>Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and smoother, more natural motions.
- Score: 29.643549839940025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this discord between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings. Extensive evaluations Our project page is available at: https://whwjdqls.github.io/discord.github.io/.
Related papers
- Fast Autoregressive Models for Continuous Latent Generation [49.079819389916764]
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP.
Recent work, the masked autoregressive model (MAR) bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head.
We propose Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head.
arXiv Detail & Related papers (2025-04-24T13:57:08Z) - Flow Intelligence: Robust Feature Matching via Temporal Signature Correlation [12.239059174851654]
Flow Intelligence is a paradigm-shifting approach that focuses on temporal motion patterns exclusively.
Our method extracts motion signatures from pixel blocks across consecutive frames and extract temporal motion signatures between videos.
By leveraging motion rather than appearance, Flow Intelligence enables robust, real-time video feature matching in diverse environments.
arXiv Detail & Related papers (2025-04-16T10:25:20Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.
The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.
To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences.
We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token.
Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z) - MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space [40.60429652169086]
Text-conditioned streaming motion generation requires us to predict the next-step human pose based on variable-length historical motions and incoming texts.
Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths.
We propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model.
arXiv Detail & Related papers (2025-03-19T17:32:24Z) - Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation [45.214169930573775]
We propose a conditional diffusion model to synthesize contextually smooth transition frames.
Our approach transforms the unsupervised problem of transition frame generation into a supervised training task.
Experiments on the PHO14TENIX, USTC-CSL100, and USTC-500 datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-11-25T15:06:49Z) - Continuous Speculative Decoding for Autoregressive Image Generation [33.05392461723613]
Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts.
speculative decoding has proven effective in accelerating Large Language Models (LLMs)
This work generalizes the speculative decoding algorithm from discrete tokens to continuous space.
arXiv Detail & Related papers (2024-11-18T09:19:15Z) - Efficient Text-driven Motion Generation via Latent Consistency Training [21.348658259929053]
We propose a motion latent consistency training framework (MLCT) to solve nonlinear reverse diffusion trajectories.<n>By combining these enhancements, we achieve stable and consistency training in non-pixel modality and latent representation spaces.
arXiv Detail & Related papers (2024-05-05T02:11:57Z) - SMURF: Continuous Dynamics for Motion-Deblurring Radiance Fields [14.681688453270523]
We propose sequential motion understanding radiance fields (SMURF), a novel approach that employs neural ordinary differential equation (Neural-ODE) to model continuous camera motion.
Our model, rigorously evaluated against benchmark datasets, demonstrates state-of-the-art performance both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-03-12T11:32:57Z) - Seamless Human Motion Composition with Blended Positional Encodings [38.85158088021282]
We introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without postprocessing or redundant denoising steps.
We achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets.
arXiv Detail & Related papers (2024-02-23T18:59:40Z) - RoHM: Robust Human Motion Reconstruction via Diffusion [58.63706638272891]
RoHM is an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos.
It conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates.
Our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time.
arXiv Detail & Related papers (2024-01-16T18:57:50Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion [56.38386580040991]
Consistency Trajectory Model (CTM) is a generalization of Consistency Models (CM)
CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance.
Unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods.
arXiv Detail & Related papers (2023-10-01T05:07:17Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot
Action Recognition [50.345327516891615]
We develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching.
arXiv Detail & Related papers (2023-04-03T13:09:39Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - Value Iteration in Continuous Actions, States and Time [99.00362538261972]
We propose a continuous fitted value iteration (cFVI) algorithm for continuous states and actions.
The optimal policy can be derived for non-linear control-affine dynamics.
Videos of the physical system are available at urlhttps://sites.google.com/view/value-iteration.
arXiv Detail & Related papers (2021-05-10T21:40:56Z) - Continuity-Discrimination Convolutional Neural Network for Visual Object
Tracking [150.51667609413312]
This paper proposes a novel model, named Continuity-Discrimination Convolutional Neural Network (CD-CNN) for visual object tracking.
To address this problem, CD-CNN models temporal appearance continuity based on the idea of temporal slowness.
In order to alleviate inaccurate target localization and drifting, we propose a novel notion, object-centroid.
arXiv Detail & Related papers (2021-04-18T06:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.