WASE: Learning When to Attend for Speaker Extraction in Cocktail Party
Environments
- URL: http://arxiv.org/abs/2106.07016v1
- Date: Sun, 13 Jun 2021 14:56:05 GMT
- Title: WASE: Learning When to Attend for Speaker Extraction in Cocktail Party
Environments
- Authors: Yunzhe Hao, Jiaming Xu, Peng Zhang, Bo Xu
- Abstract summary: In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the speaker.
Inspired by the cue of sound onset, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task.
From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection.
- Score: 21.4128321045702
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the speaker extraction problem, it is found that additional information
from the target speaker contributes to the tracking and extraction of the
target speaker, which includes voiceprint, lip movement, facial expression, and
spatial information. However, no one cares for the cue of sound onset, which
has been emphasized in the auditory scene analysis and psychology. Inspired by
it, we explicitly modeled the onset cue and verified the effectiveness in the
speaker extraction task. We further extended to the onset/offset cues and got
performance improvement. From the perspective of tasks, our onset/offset-based
model completes the composite task, a complementary combination of speaker
extraction and speaker-dependent voice activity detection. We also combined
voiceprint with onset/offset cues. Voiceprint models voice characteristics of
the target while onset/offset models the start/end information of the speech.
From the perspective of auditory scene analysis, the combination of two
perception cues can promote the integrity of the auditory object. The
experiment results are also close to state-of-the-art performance, using nearly
half of the parameters. We hope that this work will inspire communities of
speech processing and psychology, and contribute to communication between them.
Our code will be available in https://github.com/aispeech-lab/wase/.
Related papers
- Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments [34.67934887761352]
Previous research has explored extracting the target speaker's characteristics from noisy audio examples.
In this work, we focus on target speaker extraction when multiple speakers are present during the enrollment stage.
Experiments show the effectiveness of our model architecture and the dedicated pretraining method for the proposed task.
arXiv Detail & Related papers (2025-02-23T15:33:44Z) - Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance.
Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production.
We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z) - Audio-video fusion strategies for active speaker detection in meetings [5.61861182374067]
We propose two types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks.
For our application context, adding motion information greatly improves performance.
We have shown that attention-based fusion improves performance while reducing the standard deviation.
arXiv Detail & Related papers (2022-06-09T08:20:52Z) - Speaker Extraction with Co-Speech Gestures Cue [79.91394239104908]
We explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction.
We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker.
The experimental results show that the co-speech gestures cue is informative in associating the target speaker, and the quality of the extracted speech shows significant improvements over the unprocessed mixture speech.
arXiv Detail & Related papers (2022-03-31T06:48:52Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.