Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation
- URL: http://arxiv.org/abs/2207.01197v1
- Date: Mon, 4 Jul 2022 04:53:39 GMT
- Title: Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation
- Authors: Xiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu
- Abstract summary: We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
- Score: 38.75352529988137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we propose a multi-modal multi-correlation learning framework
targeting at the task of audio-visual speech separation. Although previous
efforts have been extensively put on combining audio and visual modalities,
most of them solely adopt a straightforward concatenation of audio and visual
features. To exploit the real useful information behind these two modalities,
we define two key correlations which are: (1) identity correlation (between
timbre and facial attributes); (2) phonetic correlation (between phoneme and
lip motion). These two correlations together comprise the complete information,
which shows a certain superiority in separating target speaker's voice
especially in some hard cases, such as the same gender or similar content. For
implementation, contrastive learning or adversarial training approach is
applied to maximize these two correlations. Both of them work well, while
adversarial training shows its advantage by avoiding some limitations of
contrastive learning. Compared with previous research, our solution
demonstrates clear improvement on experimental metrics without additional
complexity. Further analysis reveals the validity of the proposed architecture
and its good potential for future extension.
Related papers
- Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? [12.662031101992968]
We investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets.
Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels.
Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
arXiv Detail & Related papers (2024-09-13T22:18:45Z) - Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder [22.836016610542387]
This paper introduces a novel framework within an unsupervised setting for learning voice-face associations.
By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner.
Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks.
arXiv Detail & Related papers (2024-04-15T07:05:14Z) - Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Analysis of Joint Speech-Text Embeddings for Semantic Matching [3.6423306784901235]
We study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs.
We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios.
arXiv Detail & Related papers (2022-04-04T04:50:32Z) - Look\&Listen: Multi-Modal Correlation Learning for Active Speaker
Detection and Speech Enhancement [18.488808141923492]
ADENet is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling.
Cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning.
arXiv Detail & Related papers (2022-03-04T09:53:19Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.