BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation
- URL: http://arxiv.org/abs/2505.06985v1
- Date: Sun, 11 May 2025 14:11:12 GMT
- Title: BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation
- Authors: Panwen Hu, Jiehui Huang, Qiang Sun, Xiaodan Liang,
- Abstract summary: We propose an autoregressive structure and texture propagation module (STPM) for customized text-to-video (CT2V) generation.<n> STPM extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency.<n>We also introduce a test-time reward optimization (TTRO) method to further refine fine-grained details.
- Score: 47.21414443162965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the effectiveness of STPM and TTRO, demonstrating improvements of 7.8 and 13.1 in CLIP-I and DINO consistency metrics over the baseline, respectively.
Related papers
- STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing [35.50656689789427]
STR-Match is a training-free video editing system that produces visually appealing and coherent videos.<n> STR-Match consistently outperforms existing methods in both visual quality andtemporal consistency.
arXiv Detail & Related papers (2025-06-28T12:36:19Z) - DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
DynamiCtrl is a novel framework that explores different pose-guided structures in MM-DiT.<n>We propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features.<n>By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion.
arXiv Detail & Related papers (2025-03-27T08:07:45Z) - Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [26.706957163997043]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion.<n>Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - Dynamic Relation Transformer for Contextual Text Block Detection [9.644204545582742]
Contextual Text Block Detection is the task of identifying coherent text blocks within the complexity of natural scenes.
Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem.
We introduce a new framework that frames CTBD as a graph generation problem.
arXiv Detail & Related papers (2024-01-17T14:17:59Z) - Towards Consistent Video Editing with Text-to-Image Diffusion Models [10.340371518799444]
Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner.
These methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence.
We propose a novel EI$2$ model towards textbfEnhancing vtextbfIdeo textbfEditing constextbfIstency of TTI-based frameworks.
arXiv Detail & Related papers (2023-05-27T10:03:36Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.