Bio-Inspired Modality Fusion for Active Speaker Detection
- URL: http://arxiv.org/abs/2003.00063v2
- Date: Tue, 13 Apr 2021 11:05:06 GMT
- Title: Bio-Inspired Modality Fusion for Active Speaker Detection
- Authors: Gustavo Assun\c{c}\~ao, Nuno Gon\c{c}alves, Paulo Menezes
- Abstract summary: This paper presents a methodology for fusing correlated auditory and visual information for active speaker detection.
The ability can have a wide range of applications, from teleconferencing systems to social robotics.
- Score: 1.0644456464343592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human beings have developed fantastic abilities to integrate information from
various sensory sources exploring their inherent complementarity. Perceptual
capabilities are therefore heightened, enabling, for instance, the well-known
"cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply
of sound signals. This fusion ability is also key in refining the perception of
sound source location, as in distinguishing whose voice is being heard in a
group conversation. Furthermore, neuroscience has successfully identified the
superior colliculus region in the brain as the one responsible for this
modality fusion, with a handful of biological models having been proposed to
approach its underlying neurophysiological process. Deriving inspiration from
one of these models, this paper presents a methodology for effectively fusing
correlated auditory and visual information for active speaker detection. Such
an ability can have a wide range of applications, from teleconferencing systems
to social robotics. The detection approach initially routes auditory and visual
information through two specialized neural network structures. The resulting
embeddings are fused via a novel layer based on the superior colliculus, whose
topological structure emulates spatial neuron cross-mapping of unimodal
perceptual fields. The validation process employed two publicly available
datasets, with achieved results confirming and greatly surpassing initial
expectations.
Related papers
- Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks [59.38765771221084]
We present a physiologically inspired speech recognition architecture compatible and scalable with deep learning frameworks.
We show end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network.
Our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance.
arXiv Detail & Related papers (2024-04-22T09:40:07Z) - Towards Decoding Brain Activity During Passive Listening of Speech [0.0]
We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods.
This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech.
Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception.
arXiv Detail & Related papers (2024-02-26T20:04:01Z) - Brain-Inspired Machine Intelligence: A Survey of
Neurobiologically-Plausible Credit Assignment [65.268245109828]
We examine algorithms for conducting credit assignment in artificial neural networks that are inspired or motivated by neurobiology.
We organize the ever-growing set of brain-inspired learning schemes into six general families and consider these in the context of backpropagation of errors.
The results of this review are meant to encourage future developments in neuro-mimetic systems and their constituent learning processes.
arXiv Detail & Related papers (2023-12-01T05:20:57Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Canonical Cortical Graph Neural Networks and its Application for Speech
Enhancement in Future Audio-Visual Hearing Aids [0.726437825413781]
This paper proposes a more biologically plausible self-supervised machine learning approach that combines multimodal information using intra-layer modulations together with canonical correlation analysis (CCA)
The approach outperformed recent state-of-the-art results considering both better clean audio reconstruction and energy efficiency, described by a reduced and smother neuron firing rate distribution.
arXiv Detail & Related papers (2022-06-06T15:20:07Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Deep Representations for Time-varying Brain Datasets [4.129225533930966]
This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities as inputs.
We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices.
These modules can be easily adapted to and are potentially useful for other applications outside the neuroscience domain.
arXiv Detail & Related papers (2022-05-23T21:57:31Z) - Successes and critical failures of neural networks in capturing
human-like speech recognition [1.1602089225841632]
Speech recognition is inherently robust in humans to a number transformations at various spectrotemporal granularities.
We evaluate state-of-the-art neural networks as stimulus-computable, optimized observers.
arXiv Detail & Related papers (2022-04-06T06:35:10Z) - Model-based analysis of brain activity reveals the hierarchy of language
in 305 subjects [82.81964713263483]
A popular approach to decompose the neural bases of language consists in correlating, across individuals, the brain responses to different stimuli.
Here, we show that a model-based approach can reach equivalent results within subjects exposed to natural stimuli.
arXiv Detail & Related papers (2021-10-12T15:30:21Z) - CogAlign: Learning to Align Textual Neural Representations to Cognitive
Language Processing Signals [60.921888445317705]
We propose a CogAlign approach to integrate cognitive language processing signals into natural language processing models.
We show that CogAlign achieves significant improvements with multiple cognitive features over state-of-the-art models on public datasets.
arXiv Detail & Related papers (2021-06-10T07:10:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.