Related papers: Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow

Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow

URL: http://arxiv.org/abs/2310.14780v1
Date: Fri, 20 Oct 2023 12:53:08 GMT
Title: Dance Your Latents: Consistent Dance Generation through Spatial-temporal Subspace Attention Guided by Motion Flow
Authors: Haipeng Fang, Zhihao Sun, Ziyao Huang, Fan Tang, Juan Cao, Sheng Tang
Abstract summary: We present Dance--Latents, a framework that makes latents dance coherently following motion flow to generate consistent dance videos. Experimental results in TikTok dataset demonstrate that our approach significantly enhancestemporal consistency of irregular generated videos.
Score: 22.1733448870831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advancement of generative AI has extended to the realm of Human Dance Generation, demonstrating superior generative capacities. However, current methods still exhibit deficiencies in achieving spatiotemporal consistency, resulting in artifacts like ghosting, flickering, and incoherent motions. In this paper, we present Dance-Your-Latents, a framework that makes latents dance coherently following motion flow to generate consistent dance videos. Firstly, considering that each constituent element moves within a confined space, we introduce spatial-temporal subspace-attention blocks that decompose the global space into a combination of regular subspaces and efficiently model the spatiotemporal consistency within these subspaces. This module enables each patch pay attention to adjacent areas, mitigating the excessive dispersion of long-range attention. Furthermore, observing that body part's movement is guided by pose control, we design motion flow guided subspace align & restore. This method enables the attention to be computed on the irregular subspace along the motion flow. Experimental results in TikTok dataset demonstrate that our approach significantly enhances spatiotemporal consistency of the generated videos.

Related papers

RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion [64.49056527678606]
We propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the radar-temporal encoder.<n>Unlike prior approaches, our method integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion.<n>Our experiments and evaluations demonstrate that the proposed method significantly outperforms state-of-the-art approaches, robustness local fidelity, generalization, and superior in complex precipitation forecasting scenarios.
arXiv Detail & Related papers (2025-10-16T17:59:13Z)
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling [53.199942923818206]
Point cloud videos capture 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions.<n> Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity.<n>We propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos.
arXiv Detail & Related papers (2025-08-20T10:46:01Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation [51.2555550979386]
Plausibility-Aware Motion Diffusion (PAMD) is a framework for generating dances that are both musically aligned and physically realistic.<n>To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG)<n>Experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions.
arXiv Detail & Related papers (2025-05-26T14:44:09Z)
ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation [2.1920014462753064]
Reactive dance generation (RDG) produces follower movements conditioned on guiding dancer and music.<n>We present ReactDance, a novel diffusion-based framework for high-fidelity RDG with long-term coherence and multi-scale controllability.
arXiv Detail & Related papers (2025-05-08T18:42:38Z)
Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow [21.821959971338767]
We propose a novel common modality fusion between frame and event modalities for high-dynamic scene optical flow. In motion fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has sparse but temporally continuous correlation.
arXiv Detail & Related papers (2025-03-10T07:16:32Z)
Lagrangian Motion Fields for Long-term Motion Generation [32.548139921363756]
We introduce the concept of Lagrangian Motion Fields, specifically designed for long-term motion generation. By treating each joint as a Lagrangian particle with uniform velocity over short intervals, our approach condenses motion representations into a series of "supermotions" Our solution is versatile and lightweight, eliminating the need for neural network preprocessing.
arXiv Detail & Related papers (2024-09-03T01:38:06Z)
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives [50.37531720524434]
We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness.
arXiv Detail & Related papers (2024-03-15T17:59:33Z)
A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability. We propose a Decoupled Scoupled Framework (DeST) to address the issues. DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z)
Segmenting the motion components of a video: A long-term unsupervised model [5.801044612920816]
We want to provide a coherent and stable motion segmentation over the video sequence. We propose a novel long-term optical-temporal model operating in a totally unsupervised way. We report experiments on four VOS, demonstrating competitive quantitative results.
arXiv Detail & Related papers (2023-10-02T09:33:54Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
LongDanceDiff: Long-term Dance Generation with Conditional Diffusion Model [3.036230795326545]
LongDanceDiff is a conditional diffusion model for sequence-to-sequence long-term dance generation. It addresses the challenges of temporal coherency and spatial constraint. We also address common visual quality issues in dance generation, such as foot sliding and unsmooth motion.
arXiv Detail & Related papers (2023-08-23T06:37:41Z)
STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond. Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z)
Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z)
Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS) We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z)
Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection [87.32774157186412]
Main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches. In this paper, we propose atemporal network to achieve such improvement in a full interactive fashion. Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
arXiv Detail & Related papers (2020-08-07T03:24:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.