LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation
- URL: http://arxiv.org/abs/2404.13558v3
- Date: Sun, 23 Mar 2025 08:01:44 GMT
- Title: LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation
- Authors: Haoyu Zheng, Wenqiao Zhang, Yaoke Wang, Juncheng Li, Zheqi Lv, Xin Min, Mengze Li, Dongping Zhang, Siliang Tang, Yueting Zhuang,
- Abstract summary: LASER is a tuning-free LLM-driven attention control framework.<n>We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER.
- Score: 52.16008431411513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, such as text-conditioned image editing, enabling the modification of existing images based on textual guidance. This capability allows for the generation of diverse images that convey highly complex visual concepts. However, existing methods primarily focus on generating new images from text-image pairs and struggle to produce fine-grained animations from existing images and textual guidance without fine-tuning. In this paper, we introduce LASER, a tuning-free LLM-driven attention control framework that follows a progressive process: LLM planning, feature-attention injection, and stable animation generation. LASER leverages a large language model (LLM) to refine general descriptions into fine-grained prompts, guiding pre-trained text-to-image models to generate aligned keyframes with subtle variations. The LLM also generates control signals for feature and attention injections, enabling seamless text-guided image morphing for various transformations without additional fine-tuning. By using the same initial noise inversion from the input image, LASER receives LLM-controlled injections during denoising and leverages interpolated text embeddings to produce a series of coherent animation frames. We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER achieves impressive results in consistent and efficient animation generation, establishing it as a powerful tool for producing detailed animations and opening new avenues in digital content creation.
Related papers
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.
ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.
We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex [4.57590454144072]
Recently, there has been a surge in the popularity of pre trained large language models (LLMs)
This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
arXiv Detail & Related papers (2024-01-08T12:30:23Z) - Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD)
SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image.
Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z) - FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic
Scene Syntax [72.89879499617858]
FlowZero is a framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-herent videos.
FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.
arXiv Detail & Related papers (2023-11-27T13:39:44Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Language Models Can See: Plugging Visual Controls in Text Generation [48.05127160095048]
We propose a training-free framework, called MAGIC, for plugging in visual controls in the generation process.
MAGIC is a plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation.
On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup.
arXiv Detail & Related papers (2022-05-05T13:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.