Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification
- URL: http://arxiv.org/abs/2410.23279v4
- Date: Tue, 12 Aug 2025 08:16:44 GMT
- Title: Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification
- Authors: Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura,
- Abstract summary: Marmoset vocalizations are less structured, highly variable, and recorded in noisy, low-resource conditions.<n>Previous CNNs handle local patterns but struggle with long-range temporal structure.<n>We applied Transformers using self-attention for global dependencies.<n>We pretrain Transformers with MAE -- a self-supervised method reconstructing masked segments from hundreds of hours of unannotated marmoset recordings.
- Score: 26.58036918301359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The marmoset, a highly vocal primate, is a key model for studying social-communicative behavior. Unlike human speech, marmoset vocalizations are less structured, highly variable, and recorded in noisy, low-resource conditions. Learning marmoset communication requires joint call segmentation, classification, and caller identification -- challenging domain tasks. Previous CNNs handle local patterns but struggle with long-range temporal structure. We applied Transformers using self-attention for global dependencies. However, Transformers show overfitting and instability on small, noisy annotated datasets. To address this, we pretrain Transformers with MAE -- a self-supervised method reconstructing masked segments from hundreds of hours of unannotated marmoset recordings. The pretraining improved stability and generalization. Results show MAE-pretrained Transformers outperform CNNs, demonstrating modern self-supervised architectures effectively model low-resource non-human vocal communication.
Related papers
- Polynomial Mixing for Efficient Self-supervised Speech Encoders [50.58463928808225]
Polynomial Mixer (PoM) is a drop-in replacement for multi-head self-attention.<n>PoM achieves its performance on downstream speech recognition tasks.
arXiv Detail & Related papers (2026-02-28T14:45:55Z) - Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - Multi Modal Information Fusion of Acoustic and Linguistic Data for Decoding Dairy Cow Vocalizations in Animal Welfare Assessment [0.0]
This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques.
We utilize the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form.
We categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness.
arXiv Detail & Related papers (2024-11-01T09:48:30Z) - SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Feature Representations for Automatic Meerkat Vocalization Classification [15.642602544201308]
This paper investigates feature representations for automatic meerkat vocalization analysis.
Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.
arXiv Detail & Related papers (2024-08-27T10:51:51Z) - On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis [19.205671029694074]
This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks.
Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
arXiv Detail & Related papers (2024-07-23T12:00:44Z) - FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs [63.8261207950923]
FunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs)
At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity.
The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.
arXiv Detail & Related papers (2024-07-04T16:49:02Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - Efficient Monaural Speech Enhancement using Spectrum Attention Fusion [15.8309037583936]
We present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity.
We construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features.
Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
arXiv Detail & Related papers (2023-08-04T11:39:29Z) - Transferring Annotator- and Instance-dependent Transition Matrix for Learning from Crowds [88.06545572893455]
In real-world crowd-sourcing scenarios, noise transition matrices are both annotator- and instance-dependent.
We first model the mixture of noise patterns by all annotators, and then transfer this modeling to individual annotators.
Experiments confirm the superiority of the proposed approach on synthetic and real-world crowd-sourcing data.
arXiv Detail & Related papers (2023-06-05T13:43:29Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Affective social anthropomorphic intelligent system [1.7849339006560665]
This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality.
A voice style transfer method is also proposed to map the attributes of a specific emotion.
arXiv Detail & Related papers (2023-04-19T18:24:57Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Exploring Self-Attention Mechanisms for Speech Separation [11.210834842425955]
This paper studies in-depth Transformers for speech separation.
We extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets.
Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers.
arXiv Detail & Related papers (2022-02-06T23:13:27Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - On the Usefulness of Self-Attention for Automatic Speech Recognition
with Transformers [40.991809705930955]
We train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard.
Compared to baseline Transformers, no performance drop but minor gains are observed.
We conclude the global view is unnecessary in training upper encoder layers.
arXiv Detail & Related papers (2020-11-08T16:01:38Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.