SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
- URL: http://arxiv.org/abs/2505.13237v2
- Date: Fri, 23 May 2025 07:41:15 GMT
- Title: SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
- Authors: Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee,
- Abstract summary: Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc.<n>While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored.<n>We introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information.<n>Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
Related papers
- Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z) - Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans.<n>Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans.<n>We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.
arXiv Detail & Related papers (2024-12-06T16:34:15Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.<n>These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models [38.64792118903994]
We evaluate gender bias in SILLMs across four semantic-related tasks.
Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods.
arXiv Detail & Related papers (2024-07-09T15:35:43Z) - Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models [49.87432626548563]
We introduce methods to assess the extent of object hallucination of publicly available LALMs.
Our findings reveal that LALMs are comparable to specialized audio captioning models in their understanding of audio content.
We explore the potential of prompt engineering to enhance LALMs' performance on discriminative questions.
arXiv Detail & Related papers (2024-06-12T16:51:54Z) - Spoken Language Intelligence of Large Language Models for Language Learning [3.1964044595140217]
We focus on evaluating the efficacy of large language models (LLMs) in the realm of education.<n>We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios.<n>We also investigate the influence of various prompting techniques such as zero- and few-shot method.<n>We find that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems.
arXiv Detail & Related papers (2023-08-28T12:47:41Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.