PTTA: A Pure Text-to-Animation Framework for High-Quality Creation
- URL: http://arxiv.org/abs/2512.18614v1
- Date: Sun, 21 Dec 2025 06:17:28 GMT
- Title: PTTA: A Pure Text-to-Animation Framework for High-Quality Creation
- Authors: Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang,
- Abstract summary: We present PTTA, a pure text-to-animation framework for high-quality animation creation.<n>We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions.<n>Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation.
- Score: 11.264791177658203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.
Related papers
- MVAnimate: Enhancing Character Animation with Multi-View Optimization [55.4217617472079]
We introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information.<n>Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs.
arXiv Detail & Related papers (2026-02-09T14:55:21Z) - Animate-X++: Universal Character Image Animation with Dynamic Backgrounds [32.04255747303296]
Animate-X++ is a universal animation framework based on DiT for various character types, including anthropomorphic characters.<n>To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner.<n>For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks.
arXiv Detail & Related papers (2025-08-13T03:11:28Z) - DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds [64.53681498600065]
DreamDance is an animation framework capable of producing stable, consistent character and scene motion conditioned on precise camera trajectories.<n>We train a pose-aware video inpainting model that injects the dynamic character into the scene video while enhancing background quality.
arXiv Detail & Related papers (2025-05-30T15:54:34Z) - PhysAnimator: Physics-Guided Generative Cartoon Animation [19.124321553546242]
PhysAnimator is a novel approach for generating anime-stylized animation from static anime illustrations.<n>To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries.<n>We extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames.
arXiv Detail & Related papers (2025-01-27T22:48:36Z) - AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era [20.670217061810614]
We present a comprehensive system, AniSora, designed for animation video generation.<n>supported by the data processing pipeline with over 10M high-quality data.<n>We also collect an evaluation benchmark of various animation videos, with specifically developed metrics for animation video generation.
arXiv Detail & Related papers (2024-12-13T16:24:58Z) - UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - Regenerating Arbitrary Video Sequences with Distillation Path-Finding [6.687073794084539]
This paper presents an interactive framework to generate new sequences according to the users' preference on the starting frame.
To achieve this effectively, we first learn the feature correlation on the frameset of the given video through a proposed network called RSFNet.
Then, we develop a novel path-finding algorithm, SDPF, which formulates the knowledge of motion directions of the source video to estimate the smooth and plausible sequences.
arXiv Detail & Related papers (2023-11-13T09:05:30Z) - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos.
The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance.
Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Deep Animation Video Interpolation in the Wild [115.24454577119432]
In this work, we formally define and study the animation video code problem for the first time.
We propose an effective framework, AnimeInterp, with two dedicated modules in a coarse-to-fine manner.
Notably, AnimeInterp shows favorable perceptual quality and robustness for animation scenarios in the wild.
arXiv Detail & Related papers (2021-04-06T13:26:49Z) - Going beyond Free Viewpoint: Creating Animatable Volumetric Video of
Human Performances [7.7824496657259665]
We present an end-to-end pipeline for the creation of high-quality animatable volumetric video content of human performances.
Semantic enrichment and geometric animation ability are achieved by establishing temporal consistency in the 3D data.
For pose editing, we exploit the captured data as much as possible and kinematically deform the captured frames to fit a desired pose.
arXiv Detail & Related papers (2020-09-02T09:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.