Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers
- URL: http://arxiv.org/abs/2212.04970v1
- Date: Fri, 9 Dec 2022 16:32:46 GMT
- Title: Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in
Transformers
- Authors: Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong,
Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike
- Abstract summary: Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions.
We propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality.
Our model can generate high-fidelity lip-synced results for arbitrary subjects.
- Score: 91.00397473678088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous studies have explored generating accurately lip-synced talking faces
for arbitrary targets given audio conditions. However, most of them deform or
generate the whole facial area, leading to non-realistic results. In this work,
we delve into the formulation of altering only the mouth shapes of the target
person. This requires masking a large percentage of the original image and
seamlessly inpainting it with the aid of audio and reference frames. To this
end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework,
which produces accurate lip-sync with photo-realistic quality by predicting the
masked mouth shapes. Our key insight is to exploit desired contextual
information provided in audio and visual modalities thoroughly with delicately
designed Transformers. Specifically, we propose a convolution-Transformer
hybrid backbone and design an attention-based fusion strategy for filling the
masked parts. It uniformly attends to the textural information on the unmasked
regions and the reference frame. Then the semantic audio information is
involved in enhancing the self-attention computation. Additionally, a
refinement network with audio injection improves both image and lip-sync
quality. Extensive experiments validate that our model can generate
high-fidelity lip-synced results for arbitrary subjects.
Related papers
- SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing [19.245228801339007]
We propose a novel framework called SegTalker to decouple lip movements and image textures.
We disentangle semantic regions of image into style codes using a mask-guided encoder.
Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame.
arXiv Detail & Related papers (2024-09-05T15:11:40Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.