On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis
- URL: http://arxiv.org/abs/2407.16417v2
- Date: Wed, 24 Jul 2024 11:19:22 GMT
- Title: On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis
- Authors: Eklavya Sarkar, Mathew Magimai. -Doss,
- Abstract summary: This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks.
Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
- Score: 19.205671029694074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
Related papers
- Advanced Framework for Animal Sound Classification With Features Optimization [35.2832738406242]
We propose an automated classification framework applicable to general animal sound classification.
Our approach consistently outperforms baseline methods by over 25% in precision, recall, and accuracy.
arXiv Detail & Related papers (2024-07-03T18:33:47Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database [49.1574468325115]
We introduce textbfWhaleNet (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations.
We achieve an improvement in classification accuracy by $8-10%$ over existing architectures, corresponding to a classification accuracy of $97.61%$.
arXiv Detail & Related papers (2024-02-20T11:36:23Z) - Feature Normalization for Fine-tuning Self-Supervised Models in Speech
Enhancement [19.632358491434697]
Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning.
In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task.
Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
arXiv Detail & Related papers (2023-06-14T10:03:33Z) - Can Self-Supervised Neural Representations Pre-Trained on Human Speech
distinguish Animal Callers? [23.041173892976325]
Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space.
This paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals.
arXiv Detail & Related papers (2023-05-23T13:06:14Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural
Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples.
It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z) - Discriminative Singular Spectrum Classifier with Applications on
Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently.
Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces.
The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z) - Learning spectro-temporal representations of complex sounds with
parameterized neural networks [16.270691619752288]
We propose a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs)
We evaluated predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification.
As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations.
arXiv Detail & Related papers (2021-03-12T07:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.