The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data
- URL: http://arxiv.org/abs/2403.14048v1
- Date: Thu, 21 Mar 2024 00:13:59 GMT
- Title: The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data
- Authors: Alice Baird, Rachel Manzelli, Panagiotis Tzirakis, Chris Gagne, Haoqi Li, Sadie Allen, Sander Dieleman, Brian Kulis, Shrikanth S. Narayanan, Alan Cowen,
- Abstract summary: The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains.
There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas.
High-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset.
- Score: 28.23517306589778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas, e.g., computer vision or natural language processing. A major limitation with audio is the available data; with audio being a time-dependent modality, high-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset. In this short white paper, to encourage researchers with limited access to large-datasets, the organizers first outline several open-source datasets that are available to the community, and for the duration of the workshop are making several propriety datasets available. Namely, three vocal datasets, Hume-Prosody, Hume-VocalBurst, an acted emotional speech dataset Modulate-Sonata, and an in-game streamer dataset Modulate-Stream. We outline the current baselines on these datasets but encourage researchers from across audio to utilize them outside of the initial baseline tasks.
Related papers
- Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue
State Tracking [19.754211231250544]
We develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data.
Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data.
arXiv Detail & Related papers (2023-12-04T12:25:46Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with
Depth Information [21.864200803678003]
This work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers.
To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed.
In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition.
arXiv Detail & Related papers (2023-06-04T05:00:12Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - HEAR 2021: Holistic Evaluation of Audio Representations [55.324557862041985]
The HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning.
HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.
Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets.
arXiv Detail & Related papers (2022-03-06T18:13:09Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.