AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?
- URL: http://arxiv.org/abs/2509.17641v1
- Date: Mon, 22 Sep 2025 11:45:22 GMT
- Title: AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?
- Authors: Hyunjong Ok, Suho Yoo, Hyeonjun Kim, Jaeho Lee,
- Abstract summary: We present AuditoryBench++, a benchmark for evaluating auditory knowledge and reasoning in text-only settings.<n>The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning.<n>We also introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference.
- Score: 13.180643834705114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.
Related papers
- SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models [96.81401797908835]
We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models.<n>We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability.<n>Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates.
arXiv Detail & Related papers (2025-10-19T16:22:09Z) - WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations [67.6147632074449]
We introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations.<n>WoW-Bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom's taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events.<n>Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
arXiv Detail & Related papers (2025-08-28T16:29:46Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Scaling Auditory Cognition via Test-Time Compute in Audio Language Models [9.927800622905265]
Large language models (LLMs) have shown exceptional versatility in natural language processing.<n>Audio LLMs excel in tasks such as speech recognition and synthesis.<n>It remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments.
arXiv Detail & Related papers (2025-03-30T11:04:18Z) - Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models [11.136112399898481]
We propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models.<n>Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge.<n>Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases.
arXiv Detail & Related papers (2025-03-21T04:56:22Z) - AAD-LLM: Neural Attention-Driven Auditory Scene Understanding [9.596626274863832]
We present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention.<n>AAD-LLM predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state.<n>We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios.
arXiv Detail & Related papers (2025-02-24T03:06:45Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - AudioBERT: Audio Knowledge Augmented Language Model [11.136112399898481]
Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge.<n>We construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge.<n>Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge.<n>We propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach.
arXiv Detail & Related papers (2024-09-12T16:36:39Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.