Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
- URL: http://arxiv.org/abs/2507.04705v1
- Date: Mon, 07 Jul 2025 06:54:44 GMT
- Title: Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
- Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma,
- Abstract summary: Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
- Score: 66.97034863216892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.
Related papers
- STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives.<n>Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP [34.88916568947695]
We propose a novel CLI framework to comprehend multi-temporal dynamics.<n>For the vision side, we propose an efficient Dynamic Crossshot Attention.<n>For the semantic side, we conduct text augmentation by constructing an Action Knowledge Graph.
arXiv Detail & Related papers (2024-12-13T06:30:52Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.