HEAR 2021: Holistic Evaluation of Audio Representations
- URL: http://arxiv.org/abs/2203.03022v1
- Date: Sun, 6 Mar 2022 18:13:09 GMT
- Title: HEAR 2021: Holistic Evaluation of Audio Representations
- Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Bj\"orn W.
Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel
Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian
Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon,
Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin and Yonatan Bisk
- Abstract summary: The HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning.
HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.
Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets.
- Score: 55.324557862041985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What audio embedding approach generalizes best to a wide range of downstream
tasks across a variety of everyday domains without fine-tuning? The aim of the
HEAR 2021 NeurIPS challenge is to develop a general-purpose audio
representation that provides a strong basis for learning in a wide variety of
tasks and scenarios. HEAR 2021 evaluates audio representations using a
benchmark suite across a variety of domains, including speech, environmental
sound, and music. In the spirit of shared exchange, each participant submitted
an audio embedding model following a common API that is general-purpose,
open-source, and freely available to use. Twenty-nine models by thirteen
external teams were evaluated on nineteen diverse downstream tasks derived from
sixteen datasets. Open evaluation code, submitted models and datasets are key
contributions, enabling comprehensive and reproducible evaluation, as well as
previously impossible longitudinal studies. It still remains an open question
whether one single general-purpose audio representation can perform as
holistically as the human ear.
Related papers
- Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.7791602217381]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.
Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.
Evaluation results indicate that none of the models performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - AudioBench: A Universal Benchmark for Audio Large Language Models [41.46064884020139]
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs)
It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets.
The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic)
arXiv Detail & Related papers (2024-06-23T05:40:26Z) - The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data [28.23517306589778]
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains.
There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas.
High-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset.
arXiv Detail & Related papers (2024-03-21T00:13:59Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - Squeeze-Excitation Convolutional Recurrent Neural Networks for
Audio-Visual Scene Classification [4.191965713559235]
This paper presents a multi-modal model for automatic scene classification.
It exploits simultaneously auditory and visual information.
It has been shown to provide an excellent trade-off between prediction performance and system complexity.
arXiv Detail & Related papers (2021-07-28T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.