Related papers: Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models

Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models

URL: http://arxiv.org/abs/2505.23378v2
Date: Mon, 02 Jun 2025 10:11:58 GMT
Title: Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models
Authors: Roseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria,
Abstract summary: We reformulate this task as a meta-learning problem and explore three approaches of increasing complexity.<n>Using pre-trained speech embeddings, we evaluate these methods on a large longitudinal dataset of shift workers.<n>Our results demonstrate that all meta-learning approaches tested outperformed both cross-sectional and conventional mixed-effects models.
Score: 45.81793540247952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speaker-dependent modelling can substantially improve performance in speech-based health monitoring applications. While mixed-effect models are commonly used for such speaker adaptation, they require computationally expensive retraining for each new observation, making them impractical in a production environment. We reformulate this task as a meta-learning problem and explore three approaches of increasing complexity: ensemble-based distance models, prototypical networks, and transformer-based sequence models. Using pre-trained speech embeddings, we evaluate these methods on a large longitudinal dataset of shift workers (N=1,185, 10,286 recordings), predicting time since sleep from speech as a function of fatigue, a symptom commonly associated with ill-health. Our results demonstrate that all meta-learning approaches tested outperformed both cross-sectional and conventional mixed-effects models, with a transformer-based method achieving the strongest performance.

Related papers

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing [19.205671029694074]
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors.<n>This paper investigates whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech.
arXiv Detail & Related papers (2025-01-10T14:18:21Z)
Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models [52.1809084559048]
We propose a novel two-stage divide-and-conquer training strategy termed TDC Training.<n>It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models.<n>While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model.
arXiv Detail & Related papers (2023-12-20T03:32:58Z)
Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining [0.0]
We revisit the performance comparison between two-stage and end-to-end model. We find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models.
arXiv Detail & Related papers (2023-09-08T17:12:14Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
DDKtor: Automatic Diadochokinetic Speech Analysis [13.68342426889044]
This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems. The LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
arXiv Detail & Related papers (2022-06-29T13:34:03Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement [19.645016575334786]
This work explores how self-supervised learning can be universally used to discover speaker-specific features. We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets.
arXiv Detail & Related papers (2020-11-06T15:21:00Z)
A Spectral Energy Distance for Parallel Speech Synthesis [29.14723501889278]
Speech synthesis is an important practical generative modeling problem. We propose a new learning method that allows us to train highly parallel models of speech.
arXiv Detail & Related papers (2020-08-03T19:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.