WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation
- URL: http://arxiv.org/abs/2506.21875v2
- Date: Thu, 31 Jul 2025 09:23:52 GMT
- Title: WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation
- Authors: Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou,
- Abstract summary: We present a novel approach to thoroughly evaluate Audio Large Language Models (LLMs) in practical speech conversations.<n>We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena.<n>We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios.
- Score: 44.17470719671929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
Related papers
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z) - TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios [47.08170350061827]
Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance.<n>Most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs)<n>We propose a benchmark specifically designed to evaluate SLMs' effectiveness as conversational agents in realistic Chinese interactive settings.
arXiv Detail & Related papers (2025-07-24T03:23:55Z) - Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions [45.34333059156364]
We introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset.<n>We also propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models.
arXiv Detail & Related papers (2025-03-26T07:32:20Z) - VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z) - Classification of Spontaneous and Scripted Speech for Multilingual Audio [9.925703861731506]
Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research.<n>This paper addresses the challenge of building a classifier that generalises well across different formats and languages.<n>We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers.
arXiv Detail & Related papers (2024-12-16T15:45:10Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.