FSD50K: An Open Dataset of Human-Labeled Sound Events
- URL: http://arxiv.org/abs/2010.00475v2
- Date: Sat, 23 Apr 2022 20:12:00 GMT
- Title: FSD50K: An Open Dataset of Human-Labeled Sound Events
- Authors: Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier
Serra
- Abstract summary: We introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology.
The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms)
- Score: 30.42735806815691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing datasets for sound event recognition (SER) are relatively small
and/or domain-specific, with the exception of AudioSet, based on over 2M tracks
from YouTube videos and encompassing over 500 sound classes. However, AudioSet
is not an open dataset as its official release consists of pre-computed audio
features. Downloading the original audio tracks can be problematic due to
YouTube videos gradually disappearing and usage rights issues. To provide an
alternative benchmark dataset and thus foster SER research, we introduce
FSD50K, an open dataset containing over 51k audio clips totalling over 100h of
audio manually labeled using 200 classes drawn from the AudioSet Ontology. The
audio clips are licensed under Creative Commons licenses, making the dataset
freely distributable (including waveforms). We provide a detailed description
of the FSD50K creation process, tailored to the particularities of Freesound
data, including challenges encountered and solutions adopted. We include a
comprehensive dataset characterization along with discussion of limitations and
key factors to allow its audio-informed usage. Finally, we conduct sound event
classification experiments to provide baseline systems as well as insight on
the main factors to consider when splitting Freesound audio data for SER. Our
goal is to develop a dataset to be widely adopted by the community as a new
open benchmark for SER research.
Related papers
- Sound Check: Auditing Audio Datasets [4.955141080136429]
Generative audio models are rapidly advancing in both capabilities and public utilization.
We conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit.
We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work.
arXiv Detail & Related papers (2024-10-17T00:51:27Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S)
We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies.
We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z) - ARCA23K: An audio dataset for investigating open-set label noise [48.683197172795865]
This paper introduces ARCA23K, an automatically Retrieved and curated audio dataset comprised of over 23000 labelled Freesound clips.
We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise.
arXiv Detail & Related papers (2021-09-19T21:10:25Z) - Half-Truth: A Partially Fake Audio Detection Dataset [60.08010668752466]
This paper develops a dataset for half-truth audio detection (HAD)
Partially fake audio in the HAD dataset involves only changing a few words in an utterance.
We can not only detect fake uttrances but also localize manipulated regions in a speech using this dataset.
arXiv Detail & Related papers (2021-04-08T08:57:13Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.