Scaling Auditory Cognition via Test-Time Compute in Audio Language Models
- URL: http://arxiv.org/abs/2503.23395v1
- Date: Sun, 30 Mar 2025 11:04:18 GMT
- Title: Scaling Auditory Cognition via Test-Time Compute in Audio Language Models
- Authors: Ting Dang, Yan Gao, Hong Jia,
- Abstract summary: Large language models (LLMs) have shown exceptional versatility in natural language processing.<n>Audio LLMs excel in tasks such as speech recognition and synthesis.<n>It remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments.
- Score: 9.927800622905265
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.
Related papers
- Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models [11.136112399898481]
We propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models.<n>Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge.<n>Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases.
arXiv Detail & Related papers (2025-03-21T04:56:22Z) - Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model [26.20569269005708]
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding.<n>However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored.<n>We conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities.
arXiv Detail & Related papers (2025-01-13T11:54:40Z) - VoiceBench: Benchmarking LLM-Based Voice Assistants [58.84144494938931]
We introduce VoiceBench, the first benchmark to evaluate voice assistants based on large language models (LLMs)<n>VoiceBench includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.<n>Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
arXiv Detail & Related papers (2024-10-22T17:15:20Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z) - PhonologyBench: Evaluating Phonological Skills of Large Language Models [57.80997670335227]
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research.
We present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs.
We observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans.
arXiv Detail & Related papers (2024-04-03T04:53:14Z) - LaERC-S: Improving LLM-based Emotion Recognition in Conversation with Speaker Characteristics [25.284238441231853]
Emotion recognition in conversation (ERC) is the task of discerning human emotions for each utterance within a conversation.
Recent research in ERC has sought to exploit pre-trained large language models (LLMs) with speaker modelling to comprehend emotional states.
We present LaERC-S, a novel framework that stimulates LLMs to explore speaker characteristics involving the mental state and behavior of interlocutors.
arXiv Detail & Related papers (2024-03-12T02:37:11Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Affect Recognition in Conversations Using Large Language Models [9.689990547610664]
Affect recognition plays a pivotal role in human communication.
This study investigates the capacity of large language models (LLMs) to recognise human affect in conversations.
arXiv Detail & Related papers (2023-09-22T14:11:23Z) - Exploring the Integration of Large Language Models into Automatic Speech
Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems.
Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.