Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model
- URL: http://arxiv.org/abs/2102.03957v1
- Date: Mon, 8 Feb 2021 01:06:48 GMT
- Title: Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model
- Authors: Ivine Kuruvila, Jan Muncke, Eghart Fischer, Ulrich Hoppe
- Abstract summary: Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario.
We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
- Score: 0.1529342790344802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human brain performs remarkably well in segregating a particular speaker from
interfering speakers in a multi-speaker scenario. It has been recently shown
that we can quantitatively evaluate the segregation capability by modelling the
relationship between the speech signals present in an auditory scene and the
cortical signals of the listener measured using electroencephalography (EEG).
This has opened up avenues to integrate neuro-feedback into hearing aids
whereby the device can infer user's attention and enhance the attended speaker.
Commonly used algorithms to infer the auditory attention are based on linear
systems theory where the speech cues such as envelopes are mapped on to the EEG
signals. Here, we present a joint convolutional neural network (CNN) - long
short-term memory (LSTM) model to infer the auditory attention. Our joint
CNN-LSTM model takes the EEG signals and the spectrogram of the multiple
speakers as inputs and classifies the attention to one of the speakers. We
evaluated the reliability of our neural network using three different datasets
comprising of 61 subjects where, each subject undertook a dual-speaker
experiment. The three datasets analysed corresponded to speech stimuli
presented in three different languages namely German, Danish and Dutch. Using
the proposed joint CNN-LSTM model, we obtained a median decoding accuracy of
77.2% at a trial duration of three seconds. Furthermore, we evaluated the
amount of sparsity that our model can tolerate by means of magnitude pruning
and found that the model can tolerate up to 50% sparsity without substantial
loss of decoding accuracy.
Related papers
- NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention [47.8479647938849]
We present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue.
We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations.
arXiv Detail & Related papers (2024-09-04T07:33:01Z) - Corticomorphic Hybrid CNN-SNN Architecture for EEG-based Low-footprint
Low-latency Auditory Attention Detection [8.549433398954738]
In a multi-speaker "cocktail party" scenario, a listener can selectively attend to a speaker of interest.
Current trends in EEG-based auditory attention detection using artificial neural networks (ANN) are not practical for edge-computing platforms.
We propose a hybrid convolutional neural network-spiking neural network (CNN-SNN) architecture, inspired by the auditory cortex.
arXiv Detail & Related papers (2023-07-13T20:33:39Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Correlation based Multi-phasal models for improved imagined speech EEG
recognition [22.196642357767338]
This work aims to profit from the parallel information contained in multi-phasal EEG data recorded while speaking, imagining and performing articulatory movements corresponding to specific speech units.
A bi-phase common representation learning module using neural networks is designed to model the correlation and between an analysis phase and a support phase.
The proposed approach further handles the non-availability of multi-phasal data during decoding.
arXiv Detail & Related papers (2020-11-04T09:39:53Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.