An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
- URL: http://arxiv.org/abs/2212.10744v2
- Date: Fri, 22 Mar 2024 14:33:35 GMT
- Title: An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
- Authors: Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, Xiaolin Hu,
- Abstract summary: We propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS)
CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visualworks.
Experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters.
- Score: 22.558134249701794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.
Related papers
- Finding Shared Decodable Concepts and their Negations in the Brain [4.111712524255376]
We train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings.
We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of participant-specific contrastive models.
Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC.
arXiv Detail & Related papers (2024-05-27T21:28:26Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot
Classification with Unimodal Cues [19.800985243540797]
Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input.
We propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data.
arXiv Detail & Related papers (2023-09-23T20:05:00Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - CASP-Net: Rethinking Video Saliency Prediction from an
Audio-VisualConsistency Perceptual Perspective [30.995357472421404]
Video Saliency Prediction (VSP) imitates the selective attention mechanism of human brain.
Most VSP methods exploit semantic correlation between vision and audio modalities but ignore the negative effects due to the temporal inconsistency of audio-visual intrinsics.
Inspired by the biological inconsistency-correction within multi-sensory information, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed.
arXiv Detail & Related papers (2023-03-11T09:29:57Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Prune and distill: similar reformatting of image information along rat
visual cortex and deep neural networks [61.60177890353585]
Deep convolutional neural networks (CNNs) have been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex.
Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex.
We show that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
arXiv Detail & Related papers (2022-05-27T08:06:40Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.