Speech transformer models for extracting information from baby cries
- URL: http://arxiv.org/abs/2509.02259v1
- Date: Tue, 02 Sep 2025 12:34:33 GMT
- Title: Speech transformer models for extracting information from baby cries
- Authors: Guillem Bonafos, Jéremy Rouch, Lény Lego, David Reby, Hugues Patural, Nicolas Mathevon, Rémy Emonet,
- Abstract summary: We evaluate five pre-trained speech models on eight baby cries datasets.<n>For each dataset, we assess the latent representations of each model across all available classification tasks.<n>Our results demonstrate that the latent representations of these models can effectively classify human baby cries.
- Score: 0.6822819361110412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transfer learning using latent representations from pre-trained speech models achieves outstanding performance in tasks where labeled data is scarce. However, their applicability to non-speech data and the specific acoustic properties encoded in these representations remain largely unexplored. In this study, we investigate both aspects. We evaluate five pre-trained speech models on eight baby cries datasets, encompassing 115 hours of audio from 960 babies. For each dataset, we assess the latent representations of each model across all available classification tasks. Our results demonstrate that the latent representations of these models can effectively classify human baby cries and encode key information related to vocal source instability and identity of the crying baby. In addition, a comparison of the architectures and training strategies of these models offers valuable insights for the design of future models tailored to similar tasks, such as emotion detection.
Related papers
- Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds [24.203596224724848]
Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored.<n>We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa.<n>Results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups.
arXiv Detail & Related papers (2025-09-04T12:39:05Z) - Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z) - Synthetic data enables context-aware bioacoustic sound event detection [18.158806322128527]
We propose a methodology for training foundation models that enhances their in-context learning capabilities.<n>We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection.<n>We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.
arXiv Detail & Related papers (2025-03-01T02:03:22Z) - InfantCryNet: A Data-driven Framework for Intelligent Analysis of Infant Cries [24.06154195051215]
We present a novel data-driven framework, "InfantCryNet," for accomplishing these tasks.<n>We employ pre-trained audio models to incorporate prior knowledge into our model.<n>Experiments on real-life datasets demonstrate the superior performance of the proposed framework.
arXiv Detail & Related papers (2024-09-29T12:35:47Z) - Measuring Sound Symbolism in Audio-visual Models [21.876743976994614]
This study investigates whether pre-trained audio-visual models demonstrate associations between sounds and visual representations.
Our findings reveal connections to human language processing, providing insights in cognitive architectures and machine learning strategies.
arXiv Detail & Related papers (2024-09-18T20:33:54Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - XAI-based Comparison of Input Representations for Audio Event
Classification [10.874097312428235]
We leverage eXplainable AI (XAI) to understand the underlying classification strategies of models trained on different input representations.
Specifically, we compare two model architectures with regard to relevant input features used for Audio Event Detection.
arXiv Detail & Related papers (2023-04-27T08:30:07Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.