Object Isolated Attention for Consistent Story Visualization
- URL: http://arxiv.org/abs/2503.23353v1
- Date: Sun, 30 Mar 2025 08:16:52 GMT
- Title: Object Isolated Attention for Consistent Story Visualization
- Authors: Xiangyang Luo, Junhao Cheng, Yifan Xie, Xin Zhang, Tao Feng, Zhou Liu, Fei Ma, Fei Yu,
- Abstract summary: Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline.<n>One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes.<n>We propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms.
- Score: 16.721634474902036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.
Related papers
- Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling [5.713041172936274]
Cross-frame self-attention improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation.
Our exploration reveals that self-attention-leakage is exacerbated when trying to ensure consistency across multiple characters.
Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency.
arXiv Detail & Related papers (2025-04-08T08:30:55Z) - Enhanced Multi-Scale Cross-Attention for Person Image Generation [140.90068397518655]
We propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task.<n>Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities.<n>We introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively.
arXiv Detail & Related papers (2025-01-15T16:08:25Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis [8.386261591495103]
We introduce CoCoNO, a new algorithm that optimize the initial latent by leveraging the complementary information within self-attention and cross-attention maps.
Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented.
arXiv Detail & Related papers (2024-11-25T08:20:14Z) - Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge [33.35678923549471]
textbfFreeCure is a framework that improves the prompt consistency of personalization models.<n>We introduce a novel foundation-aware self-attention module, coupled with an inversion-based process to bring well-aligned attribute information to the personalization process.<n>FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models.
arXiv Detail & Related papers (2024-11-22T15:21:38Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Deep Collaborative Multi-Modal Learning for Unsupervised Kinship
Estimation [53.62256887837659]
Kinship verification is a long-standing research challenge in computer vision.
We propose a novel deep collaborative multi-modal learning (DCML) to integrate the underlying information presented in facial properties.
Our DCML method is always superior to some state-of-the-art kinship verification methods.
arXiv Detail & Related papers (2021-09-07T01:34:51Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.