AudioBench: A Universal Benchmark for Audio Large Language Models
- URL: http://arxiv.org/abs/2406.16020v4
- Date: Wed, 06 Nov 2024 01:49:36 GMT
- Title: AudioBench: A Universal Benchmark for Audio Large Language Models
- Authors: Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen,
- Abstract summary: We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs)
It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets.
The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic)
- Score: 41.46064884020139
- License:
- Abstract: We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - A Suite for Acoustic Language Model Evaluation [20.802090523583196]
We introduce SALMon, a novel evaluation suite encompassing background noise, emotion, speaker identity and room impulse response.
We evaluate several speech language models on SALMon, thus highlighting the strengths and weaknesses of each evaluated method.
arXiv Detail & Related papers (2024-09-11T17:34:52Z) - PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores [18.26082503192707]
We develop a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization.
In our experiments, we observe a relative gain 50% over a natural extension of Fr'eche't based metrics for Audio-Visual synchrony.
arXiv Detail & Related papers (2024-04-10T20:32:24Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - HEAR 2021: Holistic Evaluation of Audio Representations [55.324557862041985]
The HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning.
HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.
Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets.
arXiv Detail & Related papers (2022-03-06T18:13:09Z) - Self-supervised Graphs for Audio Representation Learning with Limited
Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples.
We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition.
Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.