UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
- URL: http://arxiv.org/abs/2601.01373v1
- Date: Sun, 04 Jan 2026 04:54:12 GMT
- Title: UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
- Authors: Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, Zhiyuan Liu,
- Abstract summary: UltraEval-Audio is a unified evaluation framework for audio foundation models.<n>It supports 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks.<n>It adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions.
- Score: 36.71750531005594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models' performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.
Related papers
- AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech [56.08149157180447]
We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models.<n>We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework.
arXiv Detail & Related papers (2026-02-27T03:33:37Z) - Eureka-Audio: Triggering Audio Intelligence in Compact Language Models [28.38037427018435]
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against larger models.<n>Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning.<n>To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline.
arXiv Detail & Related papers (2026-02-15T02:01:08Z) - VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing [45.15289852736435]
VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories.<n>To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio.<n>Results reveal three key findings: proprietary models do not universally outperform open-source models.
arXiv Detail & Related papers (2025-09-26T17:59:59Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - AudioBench: A Universal Benchmark for Audio Large Language Models [41.46064884020139]
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs)<n>It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets.<n>The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic)
arXiv Detail & Related papers (2024-06-23T05:40:26Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.