Related papers: MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

URL: http://arxiv.org/abs/2507.20368v1
Date: Sun, 27 Jul 2025 17:53:00 GMT
Title: MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
Authors: Shuolin Xu, Bingyuan Wang, Zeyu Cai, Fangteng Fu, Yue Ma, Tongyi Lee, Hongchuan Yu, Zeyu Wang,
Abstract summary: multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions.<n>We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks.<n>We build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above.
Score: 2.700983545680755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.

Related papers

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation [52.655400705690155]
AnimeShooter is a reference-guided multi-shot animation dataset.<n>Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images.<n>Shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions.<n>A separate subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources.
arXiv Detail & Related papers (2025-06-03T17:55:18Z)
Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models [71.78723353724493]
Animation of humanoid characters is essential in various graphics applications.<n>We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes.
arXiv Detail & Related papers (2025-03-20T10:00:22Z)
Learning to Animate Images from A Few Videos to Portray Delicate Human Actions [80.61838364885482]
Video generative models still struggle to animate static images into videos that portray delicate human actions.<n>In this paper, we explore the task of learning to animate images to portray delicate human actions using a small number of videos.<n>We propose FLASH, which learns generalizable motion patterns by forcing the model to reconstruct a video using the motion features and cross-frame correspondences of another video.
arXiv Detail & Related papers (2025-03-01T01:09:45Z)
AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era [20.670217061810614]
We present a comprehensive system, AniSora, designed for animation video generation.<n>supported by the data processing pipeline with over 10M high-quality data.<n>We also collect an evaluation benchmark of various animation videos, with specifically developed metrics for animation video generation.
arXiv Detail & Related papers (2024-12-13T16:24:58Z)
Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling [77.08568533331206]
We propose a novel multi-condition guided framework for character image animation.<n>We employ several well-designed input modules to enhance the implicit decoupling capability of the model.<n>Our method excels in generating high-quality character animations, especially in scenarios of complex backgrounds and multiple characters.
arXiv Detail & Related papers (2024-06-05T08:03:18Z)
AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment [64.02822911038848]
We present AnimateZoo, a zero-shot diffusion-based video generator to produce animal animations. Key technique used in our AnimateZoo is subject alignment, which includes two steps. Our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames.
arXiv Detail & Related papers (2024-04-07T12:57:41Z)
AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies [98.65469430034246]
Existing datasets for two-dimensional (2D) cartoon suffer from simple frame composition and monotonic movements. We present a new 2D animation visual correspondence dataset, AnimeRun, by converting open source 3D movies to full scenes in 2D style. Our analyses show that the proposed dataset not only resembles real anime more in image composition, but also possesses richer and more complex motion patterns compared to existing datasets.
arXiv Detail & Related papers (2022-11-10T17:26:21Z)
SketchBetween: Video-to-Video Synthesis for Sprite Animation via Sketches [0.9645196221785693]
2D animation is a common factor in game development, used for characters, effects and background art. Automated animation approaches exist, but are designed without animators in mind. We propose a problem formulation that adheres more closely to the standard workflow of animation.
arXiv Detail & Related papers (2022-09-01T02:43:19Z)
CAST: Character labeling in Animation using Self-supervision by Tracking [6.57697269659615]
Cartoons and animation domain videos have very different characteristics compared to real-life images and videos. Current computer vision and deep-learning solutions often fail on animated content because they were trained on natural images. We present a method to refine a semantic representation suitable for specific animated content.
arXiv Detail & Related papers (2022-01-19T14:21:43Z)
Deep Animation Video Interpolation in the Wild [115.24454577119432]
In this work, we formally define and study the animation video code problem for the first time. We propose an effective framework, AnimeInterp, with two dedicated modules in a coarse-to-fine manner. Notably, AnimeInterp shows favorable perceptual quality and robustness for animation scenarios in the wild.
arXiv Detail & Related papers (2021-04-06T13:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.