FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
- URL: http://arxiv.org/abs/2603.00159v1
- Date: Wed, 25 Feb 2026 22:08:15 GMT
- Title: FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
- Authors: Weiting Tan, Andy T. Liu, Ming Tu, Xinghua Qu, Philipp Koehn, Lu Lu,
- Abstract summary: FlowPortrait is a reinforcement-learning framework for audio-driven portrait animation.<n>It produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
- Score: 23.08428760363473
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating realistic talking-head videos remains challenging due to persistent issues such as imperfect lip synchronization, unnatural motion, and evaluation metrics that correlate poorly with human perception. We propose FlowPortrait, a reinforcement-learning framework for audio-driven portrait animation built on a multimodal backbone for autoregressive audio-to-video generation. FlowPortrait introduces a human-aligned evaluation system based on Multimodal Large Language Models (MLLMs) to assess lip-sync accuracy, expressiveness, and motion quality. These signals are combined with perceptual and temporal consistency regularizers to form a stable composite reward, which is used to post-train the generator via Group Relative Policy Optimization (GRPO). Extensive experiments, including both automatic evaluations and human preference studies, demonstrate that FlowPortrait consistently produces higher-quality talking-head videos, highlighting the effectiveness of reinforcement learning for portrait animation.
Related papers
- PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z) - Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback [9.569613635896026]
We propose a diffusion transformer (DiT)-based framework for generating talking videos of arbitrary length.<n>We also introduce a training-free method for multi-character audio-driven animation.<n> Experimental results demonstrate that our method outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-14T02:50:05Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning [73.7808110878037]
This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++)<n>By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases.<n>Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks.
arXiv Detail & Related papers (2025-05-26T13:06:01Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [71.90109867684025]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait [11.670159942656129]
FLOAT is an audio-driven talking portrait video generation method based on flow matching generative model.<n>Our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions.
arXiv Detail & Related papers (2024-12-02T02:50:07Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation [51.92522679353731]
We propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training.
We introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance.
arXiv Detail & Related papers (2024-05-07T13:55:50Z) - A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation [16.033455552126348]
We propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN.<n>We train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network.<n>Experiments show significant improvements over the state-of-the-art in head motion dynamics quality.
arXiv Detail & Related papers (2023-07-04T08:29:59Z) - Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person.
The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.