Masked Autoencoders as Universal Speech Enhancer
- URL: http://arxiv.org/abs/2602.02413v1
- Date: Mon, 02 Feb 2026 18:13:59 GMT
- Title: Masked Autoencoders as Universal Speech Enhancer
- Authors: Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han,
- Abstract summary: Masked autoencoder based universal speech enhancer is trained in a self-supervised manner.<n>We show that the proposed method achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
- Score: 5.670678893351032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
Related papers
- Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine [16.046905753937384]
This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model.<n>Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low codecs (i.e., less than 200 bps) with a minimal loss of the downstream model performance.
arXiv Detail & Related papers (2025-07-17T00:32:07Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster
Fine-tuning with Less Labels in Speech Processing [66.92823764664206]
We take a sober look into pre-trained speech encoders and rewire their representation space without requiring task-specific labels.
Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement.
arXiv Detail & Related papers (2022-10-24T08:27:09Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative Models [40.239246150027235]
We present a detailed overview of the diffusion process that is based on a differential equation.<n>We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.<n>In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.