AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
- URL: http://arxiv.org/abs/2502.16794v2
- Date: Fri, 14 Mar 2025 20:46:33 GMT
- Title: AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
- Authors: Xilin Jiang, Sukru Samet Dindar, Vishal Choudhari, Stephan Bickel, Ashesh Mehta, Guy M McKhann, Daniel Friedman, Adeen Flinker, Nima Mesgarani,
- Abstract summary: We present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention.<n>AAD-LLM predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state.<n>We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios.
- Score: 9.596626274863832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.
Related papers
- Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics [54.03209351287654]
We propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities.
We present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events.
We will open source our evaluation platform to promote the development of advanced conversational AI systems.
arXiv Detail & Related papers (2025-03-03T04:46:04Z) - Single-word Auditory Attention Decoding Using Deep Learning Model [9.698931956476692]
Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD)
This paper presents a deep learning approach, based on EEGNet, to address this challenge.
arXiv Detail & Related papers (2024-10-15T21:57:19Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Egocentric Auditory Attention Localization in Conversations [25.736198724595486]
We propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention.
Our approach leverages features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset.
arXiv Detail & Related papers (2023-03-28T14:52:03Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Deep Neural Networks on EEG Signals to Predict Auditory Attention Score
Using Gramian Angular Difference Field [1.9899603776429056]
In some sense, the auditory attention score of an individual shows the focus the person can have in auditory tasks.
The recent advancements in deep learning and in the non-invasive technologies recording neural activity beg the question, can deep learning along with technologies such as electroencephalography (EEG) be used to predict the auditory attention score of an individual?
In this paper, we focus on this very problem of estimating a person's auditory attention level based on their brain's electrical activity captured using 14-channeled EEG signals.
arXiv Detail & Related papers (2021-10-24T17:58:14Z) - WASE: Learning When to Attend for Speaker Extraction in Cocktail Party
Environments [21.4128321045702]
In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the speaker.
Inspired by the cue of sound onset, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task.
From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection.
arXiv Detail & Related papers (2021-06-13T14:56:05Z) - Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions.
We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration.
Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.