Look\&Listen: Multi-Modal Correlation Learning for Active Speaker
Detection and Speech Enhancement
- URL: http://arxiv.org/abs/2203.02216v1
- Date: Fri, 4 Mar 2022 09:53:19 GMT
- Title: Look\&Listen: Multi-Modal Correlation Learning for Active Speaker
Detection and Speech Enhancement
- Authors: Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha
- Abstract summary: ADENet is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling.
Cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning.
- Score: 18.488808141923492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Active speaker detection and speech enhancement have become two increasingly
attractive topics in audio-visual scenario understanding. According to their
respective characteristics, the scheme of independently designed architecture
has been widely used in correspondence to each single task. This may lead to
the learned feature representation being task-specific, and inevitably result
in the lack of generalization ability of the feature based on multi-modal
modeling. More recent studies have shown that establishing cross-modal
relationship between auditory and visual stream is a promising solution for the
challenge of audio-visual multi-task learning. Therefore, as a motivation to
bridge the multi-modal cross-attention, in this work, a unified framework
ADENet is proposed to achieve target speaker detection and speech enhancement
with joint learning of audio-visual modeling.
Related papers
- Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization [25.213694510527436]
Most existing speaker diarization systems rely exclusively on unimodal acoustic information.
We propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization.
Our approach consistently outperforms state-of-the-art speaker diarization methods.
arXiv Detail & Related papers (2024-08-22T03:34:03Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z) - MAAS: Multi-modal Assignation for Active Speaker Detection [59.08836580733918]
We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem.
Our experiments show that, an small graph data structure built from a single frame, allows to approximate an instantaneous audio-visual assignment problem.
arXiv Detail & Related papers (2021-01-11T02:57:25Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.