BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions
- URL: http://arxiv.org/abs/2305.09994v1
- Date: Wed, 17 May 2023 06:40:31 GMT
- Title: BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions
- Authors: Jie Zhang, Qing-Tian Xu, Qiu-Shi Zhu, Zhen-Hua Ling
- Abstract summary: Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
- Score: 36.15815562576836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Time-domain single-channel speech enhancement (SE) still remains challenging
to extract the target speaker without any prior information on multi-talker
conditions. It has been shown via auditory attention decoding that the brain
activity of the listener contains the auditory information of the attended
speaker. In this paper, we thus propose a novel time-domain brain-assisted SE
network (BASEN) incorporating electroencephalography (EEG) signals recorded
from the listener for extracting the target speaker from monaural speech
mixtures. The proposed BASEN is based on the fully-convolutional time-domain
audio separation network. In order to fully leverage the complementary
information contained in the EEG signals, we further propose a convolutional
multi-layer cross attention module to fuse the dual-branch features.
Experimental results on a public dataset show that the proposed model
outperforms the state-of-the-art method in several evaluation metrics. The
reproducible code is available at https://github.com/jzhangU/Basen.git.
Related papers
- Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Talking Head Generation Driven by Speech-Related Facial Action Units and
Audio- Based on Multimodal Representation Fusion [30.549120935873407]
Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips.
Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles.
We propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network.
arXiv Detail & Related papers (2022-04-27T08:05:24Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model [0.1529342790344802]
Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario.
We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
arXiv Detail & Related papers (2021-02-08T01:06:48Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.