Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations
- URL: http://arxiv.org/abs/2601.21084v1
- Date: Wed, 28 Jan 2026 22:13:05 GMT
- Title: Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations
- Authors: Amit Meghanani, Thomas Hain,
- Abstract summary: Front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models are effective for downstream tasks in noisy conditions.<n>MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information.<n>This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE.
- Score: 25.2377839206337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.
Related papers
- Self-Supervised Learning for Speaker Recognition: A study and review [0.0]
Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations.<n>The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR) remains in its early stages.<n>This work aims to highlight recent trends and advancements, identifying current challenges in the field.
arXiv Detail & Related papers (2026-02-11T13:16:07Z) - Subspace Alignment for Vision-Language Model Test-time Adaptation [82.83192844597593]
Vision-language models (VLMs) are vulnerable to distribution shifts.<n>Existing test-time adaptation methods rely on zero-shot predictions as pseudo-labels for self-training.<n>We propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions.
arXiv Detail & Related papers (2026-01-13T02:02:41Z) - GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model [35.12859489567766]
We present GenTSE, a two-stage decoder-only generative LM approach for TSE.<n> separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech.<n>Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
arXiv Detail & Related papers (2025-12-24T06:13:02Z) - How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)<n>In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.<n>We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z) - Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks [64.02867484165476]
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems.<n>The commonly used FL approach (i.e., textscFedAvg) in S2T tasks typically suffers from extensive communication overhead.<n>We propose a personalized federated S2T framework that introduces textscFedLoRA, a lightweight LoRA module for client-side tuning and interaction with the server, and textscFedMem, a global model equipped with a $k$-near
arXiv Detail & Related papers (2024-01-18T15:39:38Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Improving generalizability of distilled self-supervised speech
processing models under distorted settings [46.503354111827356]
Self-supervised learned (SSL) speech pre-trained models perform well across various speech processing tasks.
This paper proposes to apply Cross-Distortion Mapping and Domain Adversarial Training to SSL models during knowledge distillation.
arXiv Detail & Related papers (2022-10-14T17:17:45Z) - Improving Self-Supervised Learning by Characterizing Idealized
Representations [155.1457170539049]
We prove necessary and sufficient conditions for any task invariant to given data augmentations.
For contrastive learning, our framework prescribes simple but significant improvements to previous methods.
For non-contrastive learning, we use our framework to derive a simple and novel objective.
arXiv Detail & Related papers (2022-09-13T18:01:03Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.