Related papers: Text-Animator: Controllable Visual Text Video Generation

Text-Animator: Controllable Visual Text Video Generation

URL: http://arxiv.org/abs/2406.17777v1
Date: Tue, 25 Jun 2024 17:59:41 GMT
Title: Text-Animator: Controllable Visual Text Video Generation
Authors: Lin Liu, Quande Liu, Shengju Qian, Yuan Zhou, Wengang Zhou, Houqiang Li, Lingxi Xie, Qi Tian,
Abstract summary: We propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
Score: 149.940821790235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.

Related papers

T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models [12.120541052871486]
T2VTextBench is the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models.<n>We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text.
arXiv Detail & Related papers (2025-05-08T04:49:52Z)
Generating Animated Layouts as Structured Text Representations [5.4694418946955]
We introduce Animated Layout Generation to extend static graphic layouts with temporal dynamics.<n>We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements.<n>We present VAKER, a text-to-video advertisement generation pipeline.
arXiv Detail & Related papers (2025-05-02T03:37:09Z)
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes [30.947323186487885]
Complex Visual Text Generation (CVTG) centers on generating intricate textual content distributed across diverse regions within visual images. InG, image generation models often rendering distorted and blurred visual text or missing some visual text. We propose TextCrafter, a novel multi-visual text rendering method.
arXiv Detail & Related papers (2025-03-30T14:36:55Z)
VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation [48.318567065609216]
VAST (Video As Storyboard from Text) is a framework to generate high-quality videos from textual descriptions. By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition. Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression.
arXiv Detail & Related papers (2024-12-21T15:59:07Z)
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography" It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z)
LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z)
Text2Performer: Text-Driven Human Video Generation [97.3849869893433]
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts.
arXiv Detail & Related papers (2023-04-17T17:59:02Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.