PhasePerturbation: Speech Data Augmentation via Phase Perturbation for
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2312.08571v1
- Date: Wed, 13 Dec 2023 23:46:26 GMT
- Title: PhasePerturbation: Speech Data Augmentation via Phase Perturbation for
Automatic Speech Recognition
- Authors: Chengxi Lei, Satwinder Singh, Feng Hou, Xiaoyun Jia, Ruili Wang
- Abstract summary: We propose a novel speech data augmentation method called PhasePerturbation.
PhasePerturbation operates dynamically on the phase spectrum of speech.
- Score: 22.322528334591134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of the current speech data augmentation methods operate on either the
raw waveform or the amplitude spectrum of speech. In this paper, we propose a
novel speech data augmentation method called PhasePerturbation that operates
dynamically on the phase spectrum of speech. Instead of statically rotating a
phase by a constant degree, PhasePerturbation utilizes three dynamic phase
spectrum operations, i.e., a randomization operation, a frequency masking
operation, and a temporal masking operation, to enhance the diversity of speech
data. We conduct experiments on wav2vec2.0 pre-trained ASR models by
fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The
experimental results demonstrate 10.9\% relative reduction in the word error
rate (WER) compared with the baseline model fine-tuned without any augmentation
operation. Furthermore, the proposed method achieves additional improvements
(12.9\% and 15.9\%) in WER by complementing the Vocal Tract Length Perturbation
(VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation
methods. The results highlight the capability of PhasePerturbation to improve
the current amplitude spectrum-based augmentation methods.
Related papers
- Stage-Wise and Prior-Aware Neural Speech Phase Prediction [28.422370098313788]
This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model.
In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum.
The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase.
arXiv Detail & Related papers (2024-10-07T12:45:20Z) - GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer [26.567649613966974]
This paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer.
The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space.
arXiv Detail & Related papers (2024-08-03T17:18:26Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Speech Enhancement with Perceptually-motivated Optimization and Dual
Transformations [5.4878772986187565]
We propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE.
Our proposed model achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27% smaller than the SOTA.
With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
arXiv Detail & Related papers (2022-09-24T02:33:40Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase
Spectra [22.675699190161417]
This paper proposes a new approach for MVF estimation which exploits both amplitude and phase spectra.
It is shown that phase conveys relevant information about the harmonicity of the voice signal, and that it can be jointly used with features derived from the amplitude spectrum.
The proposed technique is compared to two state-of-the-art methods, and shows a superior performance in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-05-31T13:40:46Z) - Transforming Spectrum and Prosody for Emotional Voice Conversion with
Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life.
We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data.
We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.