Related papers: Synthetic data enables context-aware bioacoustic sound event detection

Synthetic data enables context-aware bioacoustic sound event detection

URL: http://arxiv.org/abs/2503.00296v1
Date: Sat, 01 Mar 2025 02:03:22 GMT
Title: Synthetic data enables context-aware bioacoustic sound event detection
Authors: Benjamin Hoffman, David Robinson, Marius Miron, Vittorio Baglione, Daniela Canestrari, Damian Elias, Eva Trapote, Olivier Pietquin,
Abstract summary: We propose a methodology for training foundation models that enhances their in-context learning capabilities.<n>We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection.<n>We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.
Score: 18.158806322128527
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We propose a methodology for training foundation models that enhances their in-context learning capabilities within the domain of bioacoustic signal processing. We use synthetically generated training data, introducing a domain-randomization-based pipeline that constructs diverse acoustic scenes with temporally strong labels. We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection. Our second contribution is a public benchmark of 13 diverse few-shot bioacoustics tasks. Our model outperforms previously published methods by 49%, and we demonstrate that this is due to both model design and data scale. We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.

Related papers

Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds [24.203596224724848]
Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored.<n>We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa.<n>Results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups.
arXiv Detail & Related papers (2025-09-04T12:39:05Z)
Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z)
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics [35.72581102737726]
We present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics.<n>Our training dataset consists of carefully curated text-audio pairs spanning bioacoustics, speech, and music.<n>We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks.
arXiv Detail & Related papers (2024-11-11T18:01:45Z)
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics [2.1019401515721583]
animal2vec is an interpretable large transformer model that learns from unlabeled audio and refines its understanding with labeled data. Meerkat Audio Transcripts is the largest labeled dataset on non-human terrestrial mammals. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset.
arXiv Detail & Related papers (2024-06-03T12:11:01Z)
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities. RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
Transferable Models for Bioacoustics with Human Language Supervision [0.0]
BioLingual is a new model for bioacoustics based on contrastive language-audio pretraining. It can identify over a thousand species' calls across taxa, complete bioacoustic tasks zero-shot, and retrieve animal vocalization recordings from natural text queries.
arXiv Detail & Related papers (2023-08-09T14:22:18Z)
Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)
BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data. The proposed method is compared with two statistical approaches based on Universal and User-dependent models. Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z)
Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples. We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition. Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z)
Metric-based multimodal meta-learning for human movement identification via footstep recognition [3.300376360949452]
We describe a novel metric-based learning approach that introduces a multimodal framework. We learn general-purpose representations from low multisensory data obtained from omnipresent sensing systems. Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity.
arXiv Detail & Related papers (2021-11-15T18:46:14Z)
Discriminative Singular Spectrum Classifier with Applications on Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently. Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces. The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.