Audio-Visual Speaker Verification via Joint Cross-Attention
- URL: http://arxiv.org/abs/2309.16569v1
- Date: Thu, 28 Sep 2023 16:25:29 GMT
- Title: Audio-Visual Speaker Verification via Joint Cross-Attention
- Authors: R. Gnana Praveen, Jahangir Alam
- Abstract summary: Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
- Score: 4.229744884478575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker verification has been widely explored using speech signals, which has
shown significant improvement using deep models. Recently, there has been a
surge in exploring faces and voices as they can offer more complementary and
comprehensive information than relying only on a single modality of speech
signals. Though current methods in the literature on the fusion of faces and
voices have shown improvement over that of individual face or voice modalities,
the potential of audio-visual fusion is not fully explored for speaker
verification. Most of the existing methods based on audio-visual fusion either
rely on score-level fusion or simple feature concatenation. In this work, we
have explored cross-modal joint attention to fully leverage the inter-modal
complementary information and the intra-modal information for speaker
verification. Specifically, we estimate the cross-attention weights based on
the correlation between the joint feature presentation and that of the
individual feature representations in order to effectively capture both
intra-modal as well inter-modal relationships among the faces and voices. We
have shown that efficiently leveraging the intra- and inter-modal relationships
significantly improves the performance of audio-visual fusion for speaker
verification. The performance of the proposed approach has been evaluated on
the Voxceleb1 dataset. Results show that the proposed approach can
significantly outperform the state-of-the-art methods of audio-visual fusion
for speaker verification.
Related papers
- Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention [3.5803801804085347]
We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework.
We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations.
Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
arXiv Detail & Related papers (2024-03-07T16:57:45Z) - Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling [24.346868432774453]
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment.
This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models.
We address training early fusion architectures by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion.
We propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions.
arXiv Detail & Related papers (2023-12-02T03:38:49Z) - Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
Parsing [58.9467115916639]
We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion.
The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information.
We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
arXiv Detail & Related papers (2023-11-14T13:27:03Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-visual speech separation based on joint feature representation
with cross-modal attention [45.210105822471256]
This study is inspired by learning joint feature representations from audio and visual streams with attention mechanism.
To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated.
The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal.
arXiv Detail & Related papers (2022-03-05T04:39:46Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Audio-Visual Event Localization via Recursive Fusion by Joint
Co-Attention [25.883429290596556]
The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively.
Recent works have shown that attention mechanism is beneficial to the fusion process.
We propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization.
arXiv Detail & Related papers (2020-08-14T21:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.