Related papers: AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

URL: http://arxiv.org/abs/2512.01334v1
Date: Mon, 01 Dec 2025 06:53:48 GMT
Title: AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation
Authors: Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, Harry Yang,
Abstract summary: Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence.<n>Existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image.<n>We introduce AlignVid, a training-free framework with two components: Attention Scaling Modulation (ASM) and Guidance Scheduling (GS)
Score: 48.47444428530136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.

Related papers

SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis [25.524477911101325]
We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS)<n>Existing NVS methods generate semantically implausible and distorted images under long-range camera motion.<n>We propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints.
arXiv Detail & Related papers (2026-02-23T17:45:21Z)
Attention Retention for Continual Learning with Vision Transformers [23.71599936772596]
Continual learning (CL) empowers AI systems to acquire knowledge from non-stationary data streams.<n>We identify attention drift in Vision Transformers as a primary source of catastrophic forgetting.<n>We propose a novel attention-retaining framework to mitigate forgetting in CL.
arXiv Detail & Related papers (2026-02-05T08:55:58Z)
Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models [41.59364061354628]
Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
arXiv Detail & Related papers (2026-01-12T07:48:26Z)
OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment [8.625923727928752]
Text-image alignment is a foundational challenge in multimedia content understanding.<n>We propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap.<n> Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter)
arXiv Detail & Related papers (2025-10-15T04:09:00Z)
FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation [16.628211648386454]
FICGen seeks to transfer frequency knowledge of degraded images into the latent diffusion space.<n>FICGen consistently surpasses existing L2I methods in terms of generative fidelity, alignment and downstream auxiliary trainability.
arXiv Detail & Related papers (2025-09-01T04:00:22Z)
From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning [65.94580484237737]
Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors.<n>We build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU)<n>To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance.
arXiv Detail & Related papers (2025-07-11T07:51:26Z)
Learning Event Completeness for Weakly Supervised Video Anomaly Detection [5.140169437190526]
We present a novel Learning Event Completeness for Weakly Supervised Video Anomaly Detection (LEC-VAD)<n>LEC-VAD encodes both category-aware and category-agnostic semantics between vision and language.<n>We develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories.
arXiv Detail & Related papers (2025-06-16T04:56:58Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views [66.1245505423179]
We show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB.<n>Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics.
arXiv Detail & Related papers (2025-03-04T03:13:44Z)
Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.<n>Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.<n>We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.<n>We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z)
2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic Segmentation [92.17700318483745]
We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points.
arXiv Detail & Related papers (2023-11-27T07:57:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.