Related papers: Bio-Inspired Modality Fusion for Active Speaker Detection

Bio-Inspired Modality Fusion for Active Speaker Detection

URL: http://arxiv.org/abs/2003.00063v2
Date: Tue, 13 Apr 2021 11:05:06 GMT
Title: Bio-Inspired Modality Fusion for Active Speaker Detection
Authors: Gustavo Assun\c{c}\~ao, Nuno Gon\c{c}alves, Paulo Menezes
Abstract summary: This paper presents a methodology for fusing correlated auditory and visual information for active speaker detection. The ability can have a wide range of applications, from teleconferencing systems to social robotics.
Score: 1.0644456464343592
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

Related papers

SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning [54.390403684665834]
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience.<n>We propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner.<n> Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance.
arXiv Detail & Related papers (2025-08-14T03:01:05Z)
Towards Unified Neural Decoding with Brain Functional Network Modeling [34.13766828046489]
We present Multi-individual Brain Region-Aggregated Network (MIBRAIN), a neural decoding framework.<n>MIBRAIN constructs a whole functional brain network model by integrating intracranial neurophysiological recordings across multiple individuals.<n>Our framework paves the way for robust neural decoding across individuals and offers insights for practical clinical applications.
arXiv Detail & Related papers (2025-05-30T12:10:37Z)
BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics [8.36470471250669]
Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research.<n>In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG)<n>These neural signals capture rich population-level activity but present key challenges: (i) task-relevant neural signals are sparsely distributed across sEEG electrodes, and (ii) they are often entangled with task-irrelevant neural signals in both sEEG and ECo
arXiv Detail & Related papers (2025-05-26T19:36:39Z)
Fundamental Survey on Neuromorphic Based Audio Classification [0.5530212768657544]
This survey provides an exhaustive examination of the current state-of-the-art in neuromorphic-based audio classification. It delves into the crucial components of neuromorphic systems, such as Spiking Neural Networks (SNNs), memristors, and neuromorphic hardware platforms. It examines how these approaches address the limitations of traditional audio classification methods, particularly in terms of energy efficiency, real-time processing, and robustness to environmental noise.
arXiv Detail & Related papers (2025-02-20T21:34:32Z)
Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks [59.38765771221084]
We present a physiologically inspired speech recognition architecture compatible and scalable with deep learning frameworks. We show end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance.
arXiv Detail & Related papers (2024-04-22T09:40:07Z)
Towards Decoding Brain Activity During Passive Listening of Speech [0.0]
We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods. This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech. Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception.
arXiv Detail & Related papers (2024-02-26T20:04:01Z)
Brain-Inspired Machine Intelligence: A Survey of Neurobiologically-Plausible Credit Assignment [65.268245109828]
We examine algorithms for conducting credit assignment in artificial neural networks that are inspired or motivated by neurobiology. We organize the ever-growing set of brain-inspired learning schemes into six general families and consider these in the context of backpropagation of errors. The results of this review are meant to encourage future developments in neuro-mimetic systems and their constituent learning processes.
arXiv Detail & Related papers (2023-12-01T05:20:57Z)
Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation. A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose. The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z)
Canonical Cortical Graph Neural Networks and its Application for Speech Enhancement in Future Audio-Visual Hearing Aids [0.726437825413781]
This paper proposes a more biologically plausible self-supervised machine learning approach that combines multimodal information using intra-layer modulations together with canonical correlation analysis (CCA) The approach outperformed recent state-of-the-art results considering both better clean audio reconstruction and energy efficiency, described by a reduced and smother neuron firing rate distribution.
arXiv Detail & Related papers (2022-06-06T15:20:07Z)
Toward a realistic model of speech processing in the brain with self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate. We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z)
Deep Representations for Time-varying Brain Datasets [4.129225533930966]
This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities as inputs. We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices. These modules can be easily adapted to and are potentially useful for other applications outside the neuroscience domain.
arXiv Detail & Related papers (2022-05-23T21:57:31Z)
Successes and critical failures of neural networks in capturing human-like speech recognition [1.1602089225841632]
Speech recognition is inherently robust in humans to a number transformations at various spectrotemporal granularities. We evaluate state-of-the-art neural networks as stimulus-computable, optimized observers.
arXiv Detail & Related papers (2022-04-06T06:35:10Z)
Model-based analysis of brain activity reveals the hierarchy of language in 305 subjects [82.81964713263483]
A popular approach to decompose the neural bases of language consists in correlating, across individuals, the brain responses to different stimuli. Here, we show that a model-based approach can reach equivalent results within subjects exposed to natural stimuli.
arXiv Detail & Related papers (2021-10-12T15:30:21Z)
CogAlign: Learning to Align Textual Neural Representations to Cognitive Language Processing Signals [60.921888445317705]
We propose a CogAlign approach to integrate cognitive language processing signals into natural language processing models. We show that CogAlign achieves significant improvements with multiple cognitive features over state-of-the-art models on public datasets.
arXiv Detail & Related papers (2021-06-10T07:10:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.