Related papers: Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

URL: http://arxiv.org/abs/2601.07287v1
Date: Mon, 12 Jan 2026 07:48:26 GMT
Title: Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models
Authors: Yuanyang Yin, Yufan Deng, Shenghai Yuan, Kaipeng Zhang, Xiao Yang, Feng Zhao,
Abstract summary: Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt.<n>Existing I2V models prioritize visual consistency.<n>How to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored.
Score: 41.59364061354628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).

Related papers

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs [28.766303423132722]
Video Large Language Models (Vid-LLMs) typically fall into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches.<n>We propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model.<n>Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences.
arXiv Detail & Related papers (2026-02-17T02:51:36Z)
AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation [48.47444428530136]
Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence.<n>Existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image.<n>We introduce AlignVid, a training-free framework with two components: Attention Scaling Modulation (ASM) and Guidance Scheduling (GS)
arXiv Detail & Related papers (2025-12-01T06:53:48Z)
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding [30.223279362023337]
Video Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query.<n>Existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles.<n>We propose DualGround, a dual-branch architecture that explicitly separates global and local semantics.
arXiv Detail & Related papers (2025-10-23T05:53:01Z)
LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model [18.564067196226436]
We propose a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment.<n>LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively.
arXiv Detail & Related papers (2025-09-29T17:58:28Z)
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z)
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph [28.536724593429398]
TEn-CATG is a text-enriched AVVP framework that combines semantic calibration with category-aware temporal reasoning.<n>We show that TEn-CATG achieves robustness and superior ability to capture complex temporal and semantic dependencies in weakly supervised AVVP tasks.
arXiv Detail & Related papers (2025-09-04T10:32:40Z)
Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views [66.1245505423179]
We show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB.<n>Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics.
arXiv Detail & Related papers (2025-03-04T03:13:44Z)
Is Your Text-to-Image Model Robust to Caption Noise? [38.19377765665836]
In text-to-image (T2I) generation, a prevalent training technique involves utilizing Vision Language Models (VLMs) for image re-captioning.<n>Even though VLMs are known to exhibit hallucination, generating descriptive content that deviates from the visual reality, the ramifications of such caption hallucinations on T2I generation performance remain under-explored.
arXiv Detail & Related papers (2024-12-27T08:53:37Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.