Related papers: BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

URL: http://arxiv.org/abs/2403.10380v6
Date: Sun, 18 May 2025 10:29:36 GMT
Title: BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics
Authors: Lukas Rauch, Raphael Schwinger, Moritz Wirth, René Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, Christoph Scholz,
Abstract summary: BirdSet is a large-scale benchmark dataset for audio classification focusing on avian bioacoustics.<n>We benchmark six well-known DL models in multi-label classification across three distinct training scenarios.<n>We host our dataset on Hugging Face for easy accessibility and offer an extensive to reproduce our results.
Score: 2.2399415927517414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) has greatly advanced audio classification, yet the field is limited by the scarcity of large-scale benchmark datasets that have propelled progress in other domains. While AudioSet is a pivotal step to bridge this gap as a universal-domain dataset, its restricted accessibility and limited range of evaluation use cases challenge its role as the sole resource. Therefore, we introduce BirdSet, a large-scale benchmark dataset for audio classification focusing on avian bioacoustics. BirdSet surpasses AudioSet with over 6,800 recording hours ($\uparrow\!17\%$) from nearly 10,000 classes ($\uparrow\!18\times$) for training and more than 400 hours ($\uparrow\!7\times$) across eight strongly labeled evaluation datasets. It serves as a versatile resource for use cases such as multi-label classification, covariate shift or self-supervised learning. We benchmark six well-known DL models in multi-label classification across three distinct training scenarios and outline further evaluation use cases in audio classification. We host our dataset on Hugging Face for easy accessibility and offer an extensive codebase to reproduce our results.

Related papers

Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification [3.650448386461648]
In-the-wild speech datasets often contain undesirable features, such as multiple speakers, non-target languages, and music.<n>The Whilter model is proposed as a solution to identify these undesirable samples.<n> Whilter achieves multitask F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks.
arXiv Detail & Related papers (2025-07-29T09:58:45Z)
The iNaturalist Sounds Dataset [60.157076990024606]
iNatSounds is a collection of 230,000 audio files capturing sounds from over 5,500 species, contributed by more than 27,000 recordists worldwide.<n>The dataset encompasses sounds from birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist.<n>We envision models trained on this data powering next-generation public engagement applications, and assisting biologists, ecologists, and land use managers in processing large audio collections.
arXiv Detail & Related papers (2025-05-31T02:07:37Z)
ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North, Central and temperate Western Europe [51.82780272068934]
We present ECOSoundSet (European Cicadidae and Orthoptera Sound dataSet), a dataset containing 10,653 recordings of 200 orthopteran and 24 cicada species (217 and 26 respective taxa when including subspecies) present in North, Central, and temperate Western Europe.<n>This dataset could serve as a meaningful complement to recordings already available online for the training of deep learning algorithms for the acoustic classification of orthopterans and cicadas in North, Central, and temperate Western Europe.
arXiv Detail & Related papers (2025-04-29T13:53:33Z)
Can Masked Autoencoders Also Listen to Birds? [2.430300340530418]
Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations.<n>General-purpose models fail to generalize well when applied directly to fine-grained audio domains.<n>This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data.
arXiv Detail & Related papers (2025-04-17T12:13:25Z)
Self-supervised Learning for Acoustic Few-Shot Classification [10.180992026994739]
We introduce and evaluate a new architecture that combines CNN-based preprocessing with feature extraction based on state space models (SSMs) We pre-train this architecture using contrastive learning on the actual task data and subsequent fine-tuning with an extremely small amount of labelled data. Our evaluation shows that it outperforms state-of-the-art architectures on the few-shot classification problem.
arXiv Detail & Related papers (2024-09-15T07:45:11Z)
AlleNoise: large-scale text classification benchmark dataset with real-world label noise [40.11095094521714]
We present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise.
arXiv Detail & Related papers (2024-06-24T09:29:14Z)
AudioProtoPNet: An interpretable deep learning model for bird sound classification [1.49199020343864]
This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification. It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings. The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings.
arXiv Detail & Related papers (2024-04-16T09:37:41Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Exploring Meta Information for Audio-based Zero-shot Bird Classification [113.17261694996051]
This study investigates how meta-information can improve zero-shot audio classification. We use bird species as an example case study due to the availability of rich and diverse meta-data.
arXiv Detail & Related papers (2023-09-15T13:50:16Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Transferable Models for Bioacoustics with Human Language Supervision [0.0]
BioLingual is a new model for bioacoustics based on contrastive language-audio pretraining. It can identify over a thousand species' calls across taxa, complete bioacoustic tasks zero-shot, and retrieve animal vocalization recordings from natural text queries.
arXiv Detail & Related papers (2023-08-09T14:22:18Z)
Co-Learning Meets Stitch-Up for Noisy Multi-label Visual Recognition [70.00984078351927]
This paper focuses on reducing noise based on some inherent properties of multi-label classification and long-tailed learning under noisy cases. We propose a Stitch-Up augmentation to synthesize a cleaner sample, which directly reduces multi-label noise. A Heterogeneous Co-Learning framework is further designed to leverage the inconsistency between long-tailed and balanced distributions.
arXiv Detail & Related papers (2023-07-03T09:20:28Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)
MetaAudio: A Few-Shot Audio Classification Benchmark [2.294014185517203]
This work aims to alleviate this reliance on image-based benchmarks by offering the first comprehensive, public and fully reproducible audio based alternative. We compare the few-shot classification performance of a variety of techniques on seven audio datasets. Our experimentation shows gradient-based meta-learning methods such as MAML and Meta-Curvature consistently outperform both metric and baseline methods.
arXiv Detail & Related papers (2022-04-05T11:33:44Z)
Training Classifiers that are Universally Robust to All Label Noise Levels [91.13870793906968]
Deep neural networks are prone to overfitting in the presence of label noise. We propose a distillation-based framework that incorporates a new subcategory of Positive-Unlabeled learning. Our framework generally outperforms at medium to high noise levels.
arXiv Detail & Related papers (2021-05-27T13:49:31Z)
Noisy Label Learning for Large-scale Medical Image Classification [37.79118840129632]
We adapt a state-of-the-art noisy-label multi-class training approach to learn a multi-label classifier for the dataset Chest X-ray14. We show that the majority of label noise on Chest X-ray14 is present in the class 'No Finding', which is intuitively correct because this is the most likely class to contain one or more of the 14 diseases due to labelling mistakes.
arXiv Detail & Related papers (2021-03-06T07:42:36Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Attention-Aware Noisy Label Learning for Image Classification [97.26664962498887]
Deep convolutional neural networks (CNNs) learned on large-scale labeled samples have achieved remarkable progress in computer vision. The cheapest way to obtain a large body of labeled visual data is to crawl from websites with user-supplied labels, such as Flickr. This paper proposes the attention-aware noisy label learning approach to improve the discriminative capability of the network trained on datasets with potential label noise.
arXiv Detail & Related papers (2020-09-30T15:45:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.