Learning Audio-Visual embedding for Wild Person Verification
- URL: http://arxiv.org/abs/2209.04093v1
- Date: Fri, 9 Sep 2022 02:29:47 GMT
- Title: Learning Audio-Visual embedding for Wild Person Verification
- Authors: Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang,
Honggang Zhang, Pengfei Hu
- Abstract summary: We propose an audio-visual network that considers aggregator from a fusion perspective.
We introduce improved attentive statistics pooling for the first time in face verification.
Finally, fuse the modality with a gated attention mechanism.
- Score: 18.488385598522125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has already been observed that audio-visual embedding can be extracted
from these two modalities to gain robustness for person verification. However,
the aggregator that used to generate a single utterance representation from
each frame does not seem to be well explored. In this article, we proposed an
audio-visual network that considers aggregator from a fusion perspective. We
introduced improved attentive statistics pooling for the first time in face
verification. Then we find that strong correlation exists between modalities
during pooling, so joint attentive pooling is proposed which contains cycle
consistency to learn the implicit inter-frame weight. Finally, fuse the
modality with a gated attention mechanism. All the proposed models are trained
on the VoxCeleb2 dev dataset and the best system obtains 0.18\%, 0.27\%, and
0.49\% EER on three official trail lists of VoxCeleb1 respectively, which is to
our knowledge the best-published results for person verification. As an
analysis, visualization maps are generated to explain how this system interact
between modalities.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - A Study of Multimodal Person Verification Using Audio-Visual-Thermal
Data [4.149096351426994]
We study an approach to multimodal person verification using audio, visual, and thermal modalities.
We implement unimodal, bimodal, and trimodal verification systems using the state-of-the-art deep learning architectures.
arXiv Detail & Related papers (2021-10-23T04:41:03Z) - Summarize and Search: Learning Consensus-aware Dynamic Convolution for
Co-Saliency Detection [139.10628924049476]
Humans perform co-saliency detection by first summarizing the consensus knowledge in the whole group and then searching corresponding objects in each image.
Previous methods usually lack robustness, scalability, or stability for the first process and simply fuse consensus features with image features for the second process.
We propose a novel consensus-aware dynamic convolution model to explicitly and effectively perform the "summarize and search" process.
arXiv Detail & Related papers (2021-10-01T12:06:42Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs)
We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs.
We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z) - A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification.
We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset.
This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.