Detection and classification of vocal productions in large scale audio
recordings
- URL: http://arxiv.org/abs/2302.07640v2
- Date: Fri, 11 Aug 2023 17:50:41 GMT
- Title: Detection and classification of vocal productions in large scale audio
recordings
- Authors: Guillem Bonafos, Pierre Pudlo, Jean-Marc Freyermuth, Thierry Legou,
Jo\"el Fagot, Samuel Tron\c{c}on, Arnaud Rey
- Abstract summary: We propose an automatic data processing pipeline to extract vocal productions from large-scale natural audio recordings.
The pipeline is based on a deep neural network and adresses both issues simultaneously.
We test it on two different natural audio data sets, one from a group of Guinea baboons recorded from a primate research center and one from human babies recorded at home.
- Score: 0.12930503923129208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an automatic data processing pipeline to extract vocal productions
from large-scale natural audio recordings and classify these vocal productions.
The pipeline is based on a deep neural network and adresses both issues
simultaneously. Though a series of computationel steps (windowing, creation of
a noise class, data augmentation, re-sampling, transfer learning, Bayesian
optimisation), it automatically trains a neural network without requiring a
large sample of labeled data and important computing resources. Our end-to-end
methodology can handle noisy recordings made under different recording
conditions. We test it on two different natural audio data sets, one from a
group of Guinea baboons recorded from a primate research center and one from
human babies recorded at home. The pipeline trains a model on 72 and 77 minutes
of labeled audio recordings, with an accuracy of 94.58% and 99.76%. It is then
used to process 443 and 174 hours of natural continuous recordings and it
creates two new databases of 38.8 and 35.2 hours, respectively. We discuss the
strengths and limitations of this approach that can be applied to any massive
audio recording.
Related papers
- Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol [1.8842532732272859]
Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms.<n>While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments.<n>We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone.<n>Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks.
arXiv Detail & Related papers (2025-07-08T15:13:26Z) - pycnet-audio: A Python package to support bioacoustics data processing [0.0]
pycnet-audio is intended to provide a practical processing workflow for acoustic data.<n> pycnet-audio was originally developed by the U.S. Forest Service to support population monitoring of northern spotted owls.
arXiv Detail & Related papers (2025-06-17T17:40:21Z) - Synthetic data enables context-aware bioacoustic sound event detection [18.158806322128527]
We propose a methodology for training foundation models that enhances their in-context learning capabilities.
We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection.
We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.
arXiv Detail & Related papers (2025-03-01T02:03:22Z) - Taming Data and Transformers for Audio Generation [49.54707963286065]
AutoCap is a high-quality and efficient automatic audio captioning model.
GenAu is a scalable transformer-based audio generation architecture.
We compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Audio Interval Retrieval using Convolutional Neural Networks [0.0]
This article aims to investigate possible solutions to retrieve sound events based on a natural language query.
We specifically focus on the YamNet, AlexNet, and ResNet-50 pre-trained models to automatically classify audio samples.
Results show that the benchmarked models are comparable in terms of performance, with YamNet slightly outperforming the other two models.
arXiv Detail & Related papers (2021-09-21T01:32:18Z) - Artificially Synthesising Data for Audio Classification and Segmentation
to Improve Speech and Music Detection in Radio Broadcast [0.0]
We present a novel procedure that artificially synthesises data that resembles radio signals.
We trained a Convolutional Recurrent Neural Network (CRNN) on this synthesised data and outperformed state-of-the-art algorithms for music-speech detection.
arXiv Detail & Related papers (2021-02-19T14:47:05Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Learning to Denoise Historical Music [30.165194151843835]
We propose an audio-to-audio neural network model that learns to denoise old music recordings.
The network is trained with both reconstruction and adversarial objectives on a noisy music dataset.
Our results show that the proposed method is effective in removing noise, while preserving the quality and details of the original music.
arXiv Detail & Related papers (2020-08-05T10:05:44Z) - Spot the conversation: speaker diarisation in the wild [108.61222789195209]
We propose an automatic audio-visual diarisation method for YouTube videos.
Second, we integrate our method into a semi-automatic dataset creation pipeline.
Third, we use this pipeline to create a large-scale diarisation dataset called VoxConverse.
arXiv Detail & Related papers (2020-07-02T15:55:54Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.