Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
- URL: http://arxiv.org/abs/2506.16580v1
- Date: Thu, 19 Jun 2025 20:05:29 GMT
- Title: Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
- Authors: Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel,
- Abstract summary: We propose a first streaming accent conversion model that transforms non-native speech into a native-like accent.<n>Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism.
- Score: 52.89324095217975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.
Related papers
- Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z) - Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning [22.866607731480638]
This paper presents a novel model for converting speaker identity using non-parallel data.<n>Deep learning techniques are used to enhance disentanglement completion and linguistic content preservation.<n>The Stepback network's design offers a promising solution for advanced voice conversion tasks.
arXiv Detail & Related papers (2025-01-26T17:43:32Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling [14.98368067290024]
Takin-VC is a novel expressive zero-shot voice conversion framework.<n>We introduce an innovative hybrid content encoder that incorporates an adaptive fusion module.<n>For timbre modeling, we propose advanced memory-augmented and context-aware modules.
arXiv Detail & Related papers (2024-10-02T09:07:33Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust
Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation [41.98697872087318]
We introduce Diff-HierVC, a hierarchical VC system based on two diffusion models.
Our model achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.
arXiv Detail & Related papers (2023-11-08T14:02:53Z) - Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech
Recognition [19.971343876930767]
We present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.
Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified.
Experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently.
arXiv Detail & Related papers (2020-12-10T06:54:54Z) - Transformer Transducer: One Model Unifying Streaming and Non-streaming
Speech Recognition [16.082949461807335]
We present a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model.
We show that we can run this model in a Y-model architecture with the top layers running in parallel in low latency and high latency modes.
This allows us to have streaming speech recognition results with limited latency and delayed speech recognition results with large improvements in accuracy.
arXiv Detail & Related papers (2020-10-07T05:58:28Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Improving Accent Conversion with Reference Encoder and End-To-End
Text-To-Speech [23.30022534796909]
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
We propose approaches to improving accent conversion applicability, as well as quality.
arXiv Detail & Related papers (2020-05-19T08:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.