Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
- URL: http://arxiv.org/abs/2503.16853v2
- Date: Sun, 08 Jun 2025 11:20:45 GMT
- Title: Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
- Authors: Suho Yoo, Hyunjong Ok, Jaeho Lee,
- Abstract summary: We propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models.<n>Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge.<n>Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases.
- Score: 11.136112399898481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.
Related papers
- An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM [15.340075567628466]
This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt.<n>Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning improves the results further.
arXiv Detail & Related papers (2025-11-04T03:54:55Z) - SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models [96.81401797908835]
We introduce SAKE, the first benchmark specifically designed for editing auditory attribute knowledge in Large Audio-Language Models.<n>We benchmark seven editing methods on two LALMs along four dimensions: reliability, generality, audio/text locality, and portability.<n>Results highlight challenges such as preserving intra-attribute knowledge unrelated to the edit, generalizing edits to multimodal reasoning, and maintaining edits under sequential updates.
arXiv Detail & Related papers (2025-10-19T16:22:09Z) - UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z) - AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing? [13.180643834705114]
We present AuditoryBench++, a benchmark for evaluating auditory knowledge and reasoning in text-only settings.<n>The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning.<n>We also introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference.
arXiv Detail & Related papers (2025-09-22T11:45:22Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.<n>We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Audio Captioning RAG via Generative Pair-to-Pair Retrieval with Refined Knowledge Base [0.0]
Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base and augments them with query audio to generate accurate textual responses.
We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs.
Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD.
arXiv Detail & Related papers (2024-10-14T04:57:32Z) - Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - AudioBERT: Audio Knowledge Augmented Language Model [11.136112399898481]
Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge.<n>We construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge.<n>Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge.<n>We propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach.
arXiv Detail & Related papers (2024-09-12T16:36:39Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques.
We show that speech-based multi-modal retrieval outperforms text based retrieval.
We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Search-Engine-augmented Dialogue Response Generation with Cheaply
Supervised Query Production [98.98161995555485]
We propose a dialogue model that can access the vast and dynamic information from any search engine for response generation.
As the core module, a query producer is used to generate queries from a dialogue context to interact with a search engine.
Experiments show that our query producer can achieve R@1 and R@5 rates of 62.4% and 74.8% for retrieving gold knowledge.
arXiv Detail & Related papers (2023-02-16T01:58:10Z) - Recitation-Augmented Language Models [85.30591349383849]
We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks.
Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance.
arXiv Detail & Related papers (2022-10-04T00:49:20Z) - Retrieval-Free Knowledge-Grounded Dialogue Response Generation with
Adapters [52.725200145600624]
We propose KnowExpert to bypass the retrieval process by injecting prior knowledge into the pre-trained language models with lightweight adapters.
Experimental results show that KnowExpert performs comparably with the retrieval-based baselines.
arXiv Detail & Related papers (2021-05-13T12:33:23Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.