Three-Dimensional Lip Motion Network for Text-Independent Speaker
Recognition
- URL: http://arxiv.org/abs/2010.06363v1
- Date: Tue, 13 Oct 2020 13:18:33 GMT
- Title: Three-Dimensional Lip Motion Network for Text-Independent Speaker
Recognition
- Authors: Jianrong Wang and Tong Wu and Shanyu Wang and Mei Yu and Qiang Fang
and Ju Zhang and Li Liu
- Abstract summary: Lip motion reflects behavior characteristics of speakers, and can be used as a new kind of biometrics in speaker recognition.
We present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion.
A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions.
- Score: 24.433021731098474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip motion reflects behavior characteristics of speakers, and thus can be
used as a new kind of biometrics in speaker recognition. In the literature,
lots of works used two-dimensional (2D) lip images to recognize speaker in a
textdependent context. However, 2D lip easily suffers from various face
orientations. To this end, in this work, we present a novel end-to-end 3D lip
motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM)
to recognize speakers in both the text-independent and text-dependent contexts.
A new regional feedback module (RFM) is proposed to obtain attentions in
different lip regions. Besides, prior knowledge of lip motion is investigated
to complement RFM, where landmark-level and frame-level features are merged to
form a better feature representation. Moreover, we present two methods, i.e.,
coordinate transformation and face posture correction to pre-process the LSD-AV
dataset, which contains 68 speakers and 146 sentences per speaker. The
evaluation results on this dataset demonstrate that our proposed 3LMNet is
superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and
outperforms the state-of-the-art using 2D lip image as well as the 3D face. The
code of this work is released at
https://github.com/wutong18/Three-Dimensional-Lip-
Motion-Network-for-Text-Independent-Speaker-Recognition.
Related papers
- LASER: Lip Landmark Assisted Speaker Detection for Robustness [30.82311863795508]
We propose Lip landmark Assisted Speaker dEtection for Robustness (LASER)
LASER aims to identify speaking individuals in complex visual scenes by matching lip movements to audio.
Experiments show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals.
arXiv Detail & Related papers (2025-01-21T05:29:34Z) - S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis [14.437741528053504]
We design a Single-Shot Speech-Driven Radiance Field (S3D-NeRF) method to tackle the three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area.
Our S3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.
arXiv Detail & Related papers (2024-08-18T03:59:57Z) - Learn2Talk: 3D Talking Face Learns from 2D Talking Face [15.99315075587735]
We propose a learning framework named Learn2Talk, which can construct a better 3D talking face network.
Inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync.
A teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network.
arXiv Detail & Related papers (2024-04-19T13:45:14Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation [58.72068260933836]
Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
arXiv Detail & Related papers (2023-05-31T04:50:32Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.