Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers
- URL: http://arxiv.org/abs/2212.04970v1
- Date: Fri, 9 Dec 2022 16:32:46 GMT
- Title: Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers
- Authors: Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong,
Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike
- Abstract summary: Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
- Score: 91.00397473678088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have explored generating accurately lip-synced talking faces
for arbitrary targets given audio conditions. However, most of them deform or
generate the whole facial area, leading to non-realistic results. In this work,
we delve into the formulation of altering only the mouth shapes of the target
person. This requires masking a large percentage of the original image and
seamlessly inpainting it with the aid of audio and reference frames. To this
end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework,
which produces accurate lip-sync with photo-realistic quality by predicting the
masked mouth shapes. Our key insight is to exploit desired contextual
information provided in audio and visual modalities thoroughly with delicately
designed Transformers. Specifically, we propose a convolution-Transformer
hybrid backbone and design an attention-based fusion strategy for filling the
masked parts. It uniformly attends to the textural information on the unmasked
regions and the reference frame. Then the semantic audio information is
involved in enhancing the self-attention computation. Additionally, a
refinement network with audio injection improves both image and lip-sync
quality. Extensive experiments validate that our model can generate
high-fidelity lip-synced results for arbitrary subjects.
Related papers
- LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild [9.682333912273906]
This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio.<n>The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality.
arXiv Detail & Related papers (2026-01-30T08:02:49Z) - SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild [16.692450893925148]
SyncAnyone is a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously.<n>We develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video.<n>We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency.
arXiv Detail & Related papers (2025-12-25T16:49:40Z) - Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework [56.30142869506262]
Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion.<n>This mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio.<n>We propose a systematic evaluation methodology to analyze and quantify lip leakage.
arXiv Detail & Related papers (2025-11-05T17:11:53Z) - Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z) - SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing [19.245228801339007]
We propose a novel framework called SegTalker to decouple lip movements and image textures.
We disentangle semantic regions of image into style codes using a mask-guided encoder.
Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame.
arXiv Detail & Related papers (2024-09-05T15:11:40Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.