A contrastive-learning approach for auditory attention detection
- URL: http://arxiv.org/abs/2410.18395v1
- Date: Thu, 24 Oct 2024 03:13:53 GMT
- Title: A contrastive-learning approach for auditory attention detection
- Authors: Seyed Ali Alavi Bajestan, Mark Pitt, Donald S. Williamson,
- Abstract summary: We propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal.
We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
- Score: 11.28441753596964
- License:
- Abstract: Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
Related papers
- Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Noise-Tolerant Learning for Audio-Visual Action Recognition [31.641972732424463]
Video datasets are usually coarse-annotated or collected from the Internet.
We propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence.
Our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.
arXiv Detail & Related papers (2022-05-16T12:14:03Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Cross-Referencing Self-Training Network for Sound Event Detection in
Audio Mixtures [23.568610919253352]
This paper proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training.
The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.
arXiv Detail & Related papers (2021-05-27T18:46:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.