Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
- URL: http://arxiv.org/abs/2104.08510v1
- Date: Sat, 17 Apr 2021 10:51:55 GMT
- Title: Exploring Deep Learning for Joint Audio-Visual Lip Biometrics
- Authors: Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu
Dang
- Abstract summary: Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication.
The lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics.
We establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module.
- Score: 54.32039064193566
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual (AV) lip biometrics is a promising authentication technique that
leverages the benefits of both the audio and visual modalities in speech
communication. Previous works have demonstrated the usefulness of AV lip
biometrics. However, the lack of a sizeable AV database hinders the exploration
of deep-learning-based audio-visual lip biometrics. To address this problem, we
compile a moderate-size database using existing public databases. Meanwhile, we
establish the DeepLip AV lip biometrics system realized with a convolutional
neural network (CNN) based video module, a time-delay neural network (TDNN)
based audio module, and a multimodal fusion module. Our experiments show that
DeepLip outperforms traditional speaker recognition models in context modeling
and achieves over 50% relative improvements compared with our best single
modality baseline, with an equal error rate of 0.75% and 1.11% on the test
datasets, respectively.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification [0.0]
We present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems.
It can be readily used also for experiments with dereverberation, denoising, and speech enhancement.
arXiv Detail & Related papers (2021-11-11T20:55:58Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Transfer Learning and SpecAugment applied to SSVEP Based BCI
Classification [1.9336815376402716]
We use deep convolutional neural networks (DCNNs) to classify EEG signals in a single-channel brain-computer interface (BCI)
EEG signals were converted to spectrograms and served as input to train DCNNs using the transfer learning technique.
arXiv Detail & Related papers (2020-10-08T00:30:12Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - A Novel Deep Learning Architecture for Decoding Imagined Speech from EEG [2.4063592468412267]
We present a novel architecture that employs deep neural network (DNN) for classifying the words "in" and "cooperate"
Nine EEG channels, which best capture the underlying cortical activity, are chosen using common spatial pattern.
We have achieved accuracies comparable to the state-of-the-art results.
arXiv Detail & Related papers (2020-03-19T00:57:40Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.