Foundation Models for Bioacoustics -- a Comparative Review
- URL: http://arxiv.org/abs/2508.01277v1
- Date: Sat, 02 Aug 2025 09:15:16 GMT
- Title: Foundation Models for Bioacoustics -- a Comparative Review
- Authors: Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, Sven Tomforde,
- Abstract summary: We review bioacoustic foundation models by thoroughly analysing design decisions such as model architecture, pretraining scheme, and training paradigm.<n>We evaluate selected foundation models on classification tasks from the BEANS and BirdSet benchmarks.<n>Our comprehensive experimental analysis reveals that BirdMAE, trained on large-scale bird song data with a self-supervised objective, achieves the best performance on the BirdSet benchmark.
- Score: 0.9109149174920012
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning including major pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models by thoroughly analysing design decisions such as model architecture, pretraining scheme, and training paradigm. Additionally, we evaluate selected foundation models on classification tasks from the BEANS and BirdSet benchmarks, comparing the generalisability of learned representations under both linear and attentive probing strategies. Our comprehensive experimental analysis reveals that BirdMAE, trained on large-scale bird song data with a self-supervised objective, achieves the best performance on the BirdSet benchmark. On BEANS, BEATs$_{NLM}$, the extracted encoder of the NatureLM-audio large audio model, is slightly better. Both transformer-based models require attentive probing to extract the full performance of their representations. ConvNext$_{BS}$ and Perch models trained with supervision on large-scale bird song data remain competitive for passive acoustic monitoring classification tasks of BirdSet in linear probing settings. Training a new linear classifier has clear advantages over evaluating these models without further training. While on BEANS, the baseline model BEATs trained with self-supervision on AudioSet outperforms bird-specific models when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.
Related papers
- Can Masked Autoencoders Also Listen to Birds? [2.430300340530418]
Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations.<n>General-purpose models fail to generalize well when applied directly to fine-grained audio domains.<n>This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data.
arXiv Detail & Related papers (2025-04-17T12:13:25Z) - Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing [19.205671029694074]
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors.<n>This paper investigates whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech.
arXiv Detail & Related papers (2025-01-10T14:18:21Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - AudioProtoPNet: An interpretable deep learning model for bird sound classification [1.49199020343864]
This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification.
It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings.
The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings.
arXiv Detail & Related papers (2024-04-16T09:37:41Z) - BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics [2.2399415927517414]
BirdSet is a large-scale benchmark dataset for audio classification focusing on avian bioacoustics.<n>We benchmark six well-known DL models in multi-label classification across three distinct training scenarios.<n>We host our dataset on Hugging Face for easy accessibility and offer an extensive to reproduce our results.
arXiv Detail & Related papers (2024-03-15T15:10:40Z) - Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Tuning In: Analysis of Audio Classifier Performance in Clinical Settings with Limited Data [3.0113849517062303]
This study assesses deep learning models for audio classification in a clinical setting with the constraint of small datasets.
We analyze CNNs, including DenseNet and ConvNeXt, alongside transformer models like ViT, SWIN, and AST.
Our method highlights the benefits of pre-training on large datasets before fine-tuning on specific clinical data.
arXiv Detail & Related papers (2024-02-07T16:41:11Z) - Auto deep learning for bioacoustic signals [2.833479881983341]
This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations.
Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework.
arXiv Detail & Related papers (2023-11-08T07:22:39Z) - Exploring Meta Information for Audio-based Zero-shot Bird Classification [113.17261694996051]
This study investigates how meta-information can improve zero-shot audio classification.
We use bird species as an example case study due to the availability of rich and diverse meta-data.
arXiv Detail & Related papers (2023-09-15T13:50:16Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Overcoming Classifier Imbalance for Long-tail Object Detection with
Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution.
We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training.
Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.