Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2406.02733v1
- Date: Tue, 4 Jun 2024 19:22:13 GMT
- Title: Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
- Authors: Min-Jae Hwang, Ilia Kulikov, Benjamin Peloquin, Hongyu Gong, Peng-Jen Chen, Ann Lee,
- Abstract summary: We propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST)
Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment.
- Score: 29.789809751108304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.
Related papers
- TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
Speech Emotion Recognition (SER) is subject to ubiquitous environmental noise.
We introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge.
We show that TRNet substantially increases the system's robustness in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Noise-robust zero-shot text-to-speech synthesis conditioned on
self-supervised speech-representation model with adapters [47.75276947690528]
The zero-shot text-to-speech (TTS) method can reproduce speaker characteristics very accurately.
However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise.
In this paper, we propose a noise-robust zero-shot TTS method.
arXiv Detail & Related papers (2024-01-10T12:21:21Z) - On the Effectiveness of ASR Representations in Real-world Noisy Speech
Emotion Recognition [26.013815255299342]
We propose an efficient attempt to noisy speech emotion recognition (NSER)
We adopt the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech.
Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
arXiv Detail & Related papers (2023-11-13T05:45:55Z) - Diffusion Conditional Expectation Model for Efficient and Robust Target
Speech Extraction [73.43534824551236]
We propose an efficient generative approach named Conditional Diffusion Expectation Model (DCEM) for Target Speech Extraction (TSE)
It can handle multi- and single-speaker scenarios in both noisy and clean conditions.
Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics.
arXiv Detail & Related papers (2023-09-25T04:58:38Z) - Continuous Modeling of the Denoising Process for Speech Enhancement
Based on Deep Learning [61.787485727134424]
We use a state variable to indicate the denoising process.
A UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process.
Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement.
arXiv Detail & Related papers (2023-09-17T13:27:11Z) - A Holistic Cascade System, benchmark, and Human Evaluation Protocol for
Expressive Speech-to-Speech Translation [45.47457657122893]
Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy.
Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time.
We propose a holistic cascade system for expressive S2ST, combining multiple prosody transfer techniques previously considered only in isolation.
arXiv Detail & Related papers (2023-01-25T14:27:00Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.