Related papers: Sound Check: Auditing Audio Datasets

Sound Check: Auditing Audio Datasets

URL: http://arxiv.org/abs/2410.13114v1
Date: Thu, 17 Oct 2024 00:51:27 GMT
Title: Sound Check: Auditing Audio Datasets
Authors: William Agnew, Julia Barnett, Annie Chu, Rachel Hong, Michael Feffer, Robin Netzorg, Harry H. Jiang, Ezra Awumey, Sauvik Das,
Abstract summary: Generative audio models are rapidly advancing in both capabilities and public utilization. We conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit. We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work.
Score: 4.955141080136429
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generative audio models are rapidly advancing in both capabilities and public utilization -- several powerful generative audio models have readily available open weights, and some tech companies have released high quality generative audio products. Yet, while prior work has enumerated many ethical issues stemming from the data on which generative visual and textual models have been trained, we have little understanding of similar issues with generative audio datasets, including those related to bias, toxicity, and intellectual property. To bridge this gap, we conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit in more detail. We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work. To enable artists to see if they are in popular audio datasets and facilitate exploration of the contents of these datasets, we developed a web tool audio datasets exploration tool at https://audio-audit.vercel.app.

Related papers

Kimi-Audio Technical Report [67.69331679172303]
Kimi-Audio is an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation.
arXiv Detail & Related papers (2025-04-25T15:31:46Z)
AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z)
The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data [28.23517306589778]
The NeurIPS 2023 Machine Learning for Audio Workshop brings together machine learning (ML) experts from various audio domains. There are several valuable audio-driven ML tasks, from speech emotion recognition to audio event detection, but the community is sparse compared to other ML areas. High-quality data collection is time-consuming and costly, making it challenging for academic groups to apply their often state-of-the-art strategies to a larger, more generalizable dataset.
arXiv Detail & Related papers (2024-03-21T00:13:59Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance. We propose a simple retrieval-augmented approach for TTA models. We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z)
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z)
A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S) We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies. We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z)
Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
ARCA23K: An audio dataset for investigating open-set label noise [48.683197172795865]
This paper introduces ARCA23K, an automatically Retrieved and curated audio dataset comprised of over 23000 labelled Freesound clips. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise.
arXiv Detail & Related papers (2021-09-19T21:10:25Z)
APES: Audiovisual Person Search in Untrimmed Video [87.4124877066541]
We present the Audiovisual Person Search dataset (APES) APES contains over 1.9K identities labeled along 36 hours of video. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity.
arXiv Detail & Related papers (2021-06-03T08:16:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.