AIR-Bench: Benchmarking Large Audio-Language Models via Generative
Comprehension
- URL: http://arxiv.org/abs/2402.07729v1
- Date: Mon, 12 Feb 2024 15:41:22 GMT
- Title: AIR-Bench: Benchmarking Large Audio-Language Models via Generative
Comprehension
- Authors: Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou,
Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou
- Abstract summary: We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
- Score: 98.69691822391069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, instruction-following audio-language models have received broad
attention for human-audio interaction. However, the absence of benchmarks
capable of evaluating audio-centric interaction capabilities has impeded
advancements in this field. Previous models primarily focus on assessing
different fundamental tasks, such as Automatic Speech Recognition (ASR), and
lack an assessment of the open-ended generative capabilities centered around
audio. Thus, it is challenging to track the progression in the Large
Audio-Language Models (LALMs) domain and to provide guidance for future
improvement. In this paper, we introduce AIR-Bench (\textbf{A}udio
\textbf{I}nst\textbf{R}uction \textbf{Bench}mark), the first benchmark designed
to evaluate the ability of LALMs to understand various types of audio signals
(including human speech, natural sounds, and music), and furthermore, to
interact with humans in the textual format. AIR-Bench encompasses two
dimensions: \textit{foundation} and \textit{chat} benchmarks. The former
consists of 19 tasks with approximately 19k single-choice questions, intending
to inspect the basic single-task ability of LALMs. The latter one contains 2k
instances of open-ended question-and-answer data, directly assessing the
comprehension of the model on complex audio and its capacity to follow
instructions. Both benchmarks require the model to generate hypotheses
directly. We design a unified framework that leverages advanced language
models, such as GPT-4, to evaluate the scores of generated hypotheses given the
meta-information of the audio. Experimental results demonstrate a high level of
consistency between GPT-4-based evaluation and human evaluation. By revealing
the limitations of existing LALMs through evaluation results, AIR-Bench can
provide insights into the direction of future research.
Related papers
- AudioBench: A Universal Benchmark for Audio Large Language Models [41.46064884020139]
We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs)
AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding.
arXiv Detail & Related papers (2024-06-23T05:40:26Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - FALL-E: A Foley Sound Synthesis Model and Strategies [0.5599792629509229]
The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder.
We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input.
arXiv Detail & Related papers (2023-06-16T12:44:10Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.