Related papers: FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

URL: http://arxiv.org/abs/2503.04067v2
Date: Wed, 23 Apr 2025 10:10:57 GMT
Title: FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis
Authors: Ziqi Ni, Ao Fu, Yi Zhou,
Abstract summary: We propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK.<n>F FREAK models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of synthesized portraits.<n>Experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.
Score: 4.03322932416974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.

Related papers

SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture [3.7937714754535503]
SmoothSinger is a conditional diffusion model designed to synthesize high quality and natural singing voices.<n>It refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines.<n> Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-26T17:07:45Z)
SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers [25.36460340267922]
We present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos.<n>Our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs.
arXiv Detail & Related papers (2025-06-01T04:27:13Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities. Video and audio are obtained at much higher rates than text and are roughly aligned in time. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z)
Learning Spatiotemporal Frequency-Transformer for Low-Quality Video Super-Resolution [47.5883522564362]
Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos. Existing VSR techniques usually recover HR frames by extracting textures from nearby frames with known degradation processes. We propose a novel Frequency-Transformer (FTVSR) for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain.
arXiv Detail & Related papers (2022-12-27T16:26:15Z)
Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition [61.6677901687009]
We propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits. Our method can generate realistic and audio-lips synchronized talking portrait videos.
arXiv Detail & Related papers (2022-11-22T16:03:11Z)
Delving into the Frequency: Temporally Consistent Human Motion Transfer in the Fourier Space [34.353035276767336]
Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos. Current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality. We propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos.
arXiv Detail & Related papers (2022-09-01T05:30:23Z)
FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency. Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z)
Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning [36.352159541825095]
We propose FreqAug to filter data augmentation in frequency domain for video representation. FreqAug pushes the model to focus more on dynamic features in the video via dropping spatial or temporal low-frequency components. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations.
arXiv Detail & Related papers (2022-04-08T06:19:32Z)
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z)
Focal Frequency Loss for Image Reconstruction and Synthesis [125.7135706352493]
We show that narrowing gaps in the frequency domain can ameliorate image reconstruction and synthesis quality further. We propose a novel focal frequency loss, which allows a model to adaptively focus on frequency components that are hard to synthesize.
arXiv Detail & Related papers (2020-12-23T17:32:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.