SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation
- URL: http://arxiv.org/abs/2508.17062v1
- Date: Sat, 23 Aug 2025 15:30:17 GMT
- Title: SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation
- Authors: Peng Hu, Yu Gu, Liang Luo, Fuji Ren,
- Abstract summary: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions.<n>Existing models often struggle to maintain strong semantic consistency.<n>We propose SSG-DiT, a novel and efficient framework for high-fidelity controllable video generation.
- Score: 22.1310564466224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.
Related papers
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation [14.141157176094737]
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions.<n>Existing I2V pipelines often suffer from appearance drift and geometric distortion.<n>We propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views.
arXiv Detail & Related papers (2026-02-10T18:59:51Z) - MultiShotMaster: A Controllable Multi-Shot Video Generation Framework [67.38203939500157]
Current generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos.<n>We propose MultiShotMaster, a framework for highly controllable multi-shot video generation.
arXiv Detail & Related papers (2025-12-02T18:59:48Z) - CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - Pushing the Boundaries of State Space Models for Image and Video Generation [26.358592737557956]
We build the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention.<n>Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics.
arXiv Detail & Related papers (2025-02-03T00:51:09Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos.
Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models [84.71887272654865]
We present SparseCtrl to enable flexible structure control with temporally sparse signals.
It incorporates an additional condition to process these sparse signals while leaving the pre-trained T2V model untouched.
The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images.
arXiv Detail & Related papers (2023-11-28T16:33:08Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.