AudioProtoPNet: An interpretable deep learning model for bird sound classification
- URL: http://arxiv.org/abs/2404.10420v3
- Date: Wed, 13 Nov 2024 16:42:16 GMT
- Title: AudioProtoPNet: An interpretable deep learning model for bird sound classification
- Authors: René Heinrich, Lukas Rauch, Bernhard Sick, Christoph Scholz,
- Abstract summary: This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification.
It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings.
The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings.
- Score: 1.49199020343864
- License:
- Abstract: Deep learning models have significantly advanced acoustic bird monitoring by being able to recognize numerous bird species based on their vocalizations. However, traditional deep learning models are black boxes that provide no insight into their underlying computations, limiting their usefulness to ornithologists and machine learning engineers. Explainable models could facilitate debugging, knowledge discovery, trust, and interdisciplinary collaboration. This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification. It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings. The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of training instances. During inference, audio recordings are classified by comparing them to the learned prototypes in the embedding space, providing explanations for the model's decisions and insights into the most informative embeddings of each bird species. The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings. Its performance was evaluated on the seven test datasets of BirdSet, covering different geographical regions. AudioProtoPNet outperformed the state-of-the-art model Perch, achieving an average AUROC of 0.90 and a cmAP of 0.42, with relative improvements of 7.1% and 16.7% over Perch, respectively. These results demonstrate that even for the challenging task of multi-label bird sound classification, it is possible to develop powerful yet inherently interpretable deep learning models that provide valuable insights for ornithologists and machine learning engineers.
Related papers
- BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics [2.2399415927517414]
$texttBirdSet$ is a large-scale benchmark dataset for audio classification focusing on avian bioacoustics.
$texttBirdSet$ surpasses AudioSet with over 6,800 recording hours from nearly 10,000 classes.
We benchmark six well-known DL models in multi-label classification across three distinct training scenarios.
arXiv Detail & Related papers (2024-03-15T15:10:40Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Self-Supervised Learning for Few-Shot Bird Sound Classification [10.395255631261458]
Self-supervised learning (SSL) in audio holds significant potential across various domains.
In this study, we demonstrate that SSL is capable of acquiring meaningful representations of bird sounds from audio recordings without the need for annotations.
arXiv Detail & Related papers (2023-12-25T22:33:45Z) - Exploring Meta Information for Audio-based Zero-shot Bird Classification [113.17261694996051]
This study investigates how meta-information can improve zero-shot audio classification.
We use bird species as an example case study due to the availability of rich and diverse meta-data.
arXiv Detail & Related papers (2023-09-15T13:50:16Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Machine Learning-based Classification of Birds through Birdsong [0.3908842679355254]
We apply Mel Frequency Cepstral Coefficients (MFCC) in combination with a range of machine learning models to identify Australian birds.
We achieve an overall accuracy of 91% for the top-5 birds from the 30 selected as the case study.
Applying the models to more challenging and diverse audio files comprising 152 bird species, we achieve an accuracy of 58%.
arXiv Detail & Related papers (2022-12-09T06:20:50Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Few-shot Long-Tailed Bird Audio Recognition [3.8073142980733]
We propose a sound detection and classification pipeline to analyze soundscape recordings.
Our solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle.
arXiv Detail & Related papers (2022-06-22T04:14:25Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.