MaskSR: Masked Language Model for Full-band Speech Restoration
- URL: http://arxiv.org/abs/2406.02092v1
- Date: Tue, 4 Jun 2024 08:23:57 GMT
- Title: MaskSR: Masked Language Model for Full-band Speech Restoration
- Authors: Xu Li, Qirui Wang, Xiaoyu Liu,
- Abstract summary: Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions.
We propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth.
- Score: 7.015213589171985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions. Although several deep learning paradigms have been studied for this task, the power of the recently emerging language models has not been fully explored. In this paper, we propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens extracted using a pre-trained neural codec. During training, MaskSR is optimized to predict randomly masked tokens extracted from the high quality target speech, conditioned on the corrupted speech with various distortions. During inference, MaskSR reconstructs the target speech tokens with efficient iterative sampling. Extensive experiments show that MaskSR obtains competitive results on both the full-band speech restoration task and also on sub-tasks compared with a wide range of models.
Related papers
- EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning [46.66166658067071]
EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling) is a novel self-supervised learning approach for speech representation learning.
We introduce a novel selective and adaptive masking strategy for Masked Acoustic Modeling (MAM)
EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.
arXiv Detail & Related papers (2024-10-17T02:59:22Z) - Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility [15.463932957443973]
Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions.
MaskSR is a recently proposed generative model for this task.
We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility.
arXiv Detail & Related papers (2024-09-14T08:09:55Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - Improving Speech Representation Learning via Speech-level and
Phoneme-level Masking Approach [29.962519978925236]
We propose two kinds of masking approaches: speech-level masking and phoneme-level masking.
We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition.
arXiv Detail & Related papers (2022-10-25T07:26:47Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Multimodal Speech Recognition with Unstructured Audio Masking [49.01826387664443]
We simulate a more realistic masking scenario during model training, called RandWordMask.
Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words.
Our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted.
arXiv Detail & Related papers (2020-10-16T21:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.