VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
- URL: http://arxiv.org/abs/2510.11098v1
- Date: Mon, 13 Oct 2025 07:45:52 GMT
- Title: VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
- Authors: Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu,
- Abstract summary: Large audio language models (LALMs) have greatly enhanced multimodal conversational systems.<n>Existing benchmarks are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation.<n>We present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech.
- Score: 53.33704332801441
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - VoiceAgentBench: Are Voice Assistants ready for agentic tasks? [5.639970295197759]
We introduce VoiceAgentBench, a benchmark to evaluate SpeechLMs in realistic spoken agentic settings.<n>It comprises over 5,500 synthetic spoken queries grounded in Indian context.<n>It measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness.
arXiv Detail & Related papers (2025-10-09T09:11:38Z) - VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing [45.15289852736435]
VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories.<n>To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio.<n>Results reveal three key findings: proprietary models do not universally outperform open-source models.
arXiv Detail & Related papers (2025-09-26T17:59:59Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models [8.882948576463244]
We propose URO-Bench, an extensive benchmark for speech-to-speech (S2S) scenarios.<n>URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics.<n>Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets.
arXiv Detail & Related papers (2025-02-25T03:31:48Z) - Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans.<n>We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.<n>ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs.
arXiv Detail & Related papers (2024-12-06T16:34:15Z) - VoiceBench: Benchmarking LLM-Based Voice Assistants [58.84144494938931]
We introduce VoiceBench, the first benchmark to evaluate voice assistants based on large language models (LLMs)<n>VoiceBench includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.<n>Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
arXiv Detail & Related papers (2024-10-22T17:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.