Aligning Anime Video Generation with Human Feedback
- URL: http://arxiv.org/abs/2504.10044v1
- Date: Mon, 14 Apr 2025 09:49:34 GMT
- Title: Aligning Anime Video Generation with Human Feedback
- Authors: Bingwen Zhu, Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Yin, Yidi Wu, Huyang Sun, Zuxuan Wu,
- Abstract summary: Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns.<n>Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime.<n>We propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment.
- Score: 31.701968335565393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our dataset and code will be publicly available.
Related papers
- AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation [52.655400705690155]
AnimeShooter is a reference-guided multi-shot animation dataset.<n>Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images.<n>Shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions.<n>A separate subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources.
arXiv Detail & Related papers (2025-06-03T17:55:18Z) - Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search [23.3627657867351]
An alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content.<n>In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward.<n>We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update.
arXiv Detail & Related papers (2025-01-31T16:09:30Z) - Improving Video Generation with Human Feedback [81.48120703718774]
Video generation has achieved significant advances, but issues like unsmooth motion and misalignment between videos and prompts persist.<n>We develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model.<n>We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z) - Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.
Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.
Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z) - VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation [70.68566282567207]
We present VisionReward, a framework for learning human visual preferences in both image and video generation.<n>VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation.
arXiv Detail & Related papers (2024-12-30T16:24:09Z) - VideoDPO: Omni-Preference Alignment for Video Diffusion Generation [48.36302380755874]
Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.
We propose a VideoDPO pipeline by making several key adjustments.
Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
arXiv Detail & Related papers (2024-12-18T18:59:49Z) - SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner [46.75063691424628]
We present SUGAR, a zero-shot method for subject-driven video customization.<n>Given an input image, SUGAR is capable of generating videos for the subject and aligning the generation with arbitrary visual attributes.
arXiv Detail & Related papers (2024-12-13T20:01:51Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.<n>VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.<n>Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - GaussianStyle: Gaussian Head Avatar via StyleGAN [64.85782838199427]
We propose a novel framework that integrates the volumetric strengths of 3DGS with the powerful implicit representation of StyleGAN.
We show that our method achieves state-of-the-art performance in reenactment, novel view synthesis, and animation.
arXiv Detail & Related papers (2024-02-01T18:14:42Z) - InstructVideo: Instructing Video Diffusion Models with Human Feedback [65.9590462317474]
We propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.
InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing.
arXiv Detail & Related papers (2023-12-19T17:55:16Z) - Learning Data-Driven Vector-Quantized Degradation Model for Animation
Video Super-Resolution [59.71387128485845]
We explore the characteristics of animation videos and leverage the rich priors in real-world animation data for a more practical animation VSR model.
We propose a multi-scale Vector-Quantized Degradation model for animation video Super-Resolution (VQD-SR) to decompose the local details from global structures.
A rich-content Real Animation Low-quality (RAL) video dataset is collected for extracting the priors.
arXiv Detail & Related papers (2023-03-17T08:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.