Text-only Domain Adaptation using Unified Speech-Text Representation in
Transducer
- URL: http://arxiv.org/abs/2306.04076v1
- Date: Wed, 7 Jun 2023 00:33:02 GMT
- Title: Text-only Domain Adaptation using Unified Speech-Text Representation in
Transducer
- Authors: Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
- Abstract summary: We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus.
Experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain.
- Score: 12.417314740402587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Domain adaptation using text-only corpus is challenging in end-to-end(E2E)
speech recognition. Adaptation by synthesizing audio from text through TTS is
resource-consuming. We present a method to learn Unified Speech-Text
Representation in Conformer Transducer(USTR-CT) to enable fast domain
adaptation using the text-only corpus. Different from the previous textogram
method, an extra text encoder is introduced in our work to learn text
representation and is removed during inference, so there is no modification for
online deployment. To improve the efficiency of adaptation, single-step and
multi-step adaptations are also explored. The experiments on adapting
LibriSpeech to SPGISpeech show the proposed method reduces the word error
rate(WER) by relatively 44% on the target domain, which is better than those of
TTS method and textogram method. Also, it is shown the proposed method can be
combined with internal language model estimation(ILME) to further improve the
performance.
Related papers
- FluentEditor2: Text-based Speech Editing by Modeling Multi-Scale Acoustic and Prosody Consistency [40.95700389032375]
Text-based speech editing (TSE) allows users to edit speech by modifying the corresponding text directly without altering the original recording.
Current TSE techniques often focus on minimizing discrepancies between generated speech and reference within edited regions during training to achieve fluent TSE performance.
We propose a new fluency speech editing scheme based on our previous textitFluentEditor model, termed textittextbfFluentEditor2.
arXiv Detail & Related papers (2024-09-28T10:18:35Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - FluentEditor: Text-based Speech Editing by Considering Acoustic and
Prosody Consistency [44.7425844190807]
Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself.
We propose a fluency speech editing model, termed textitFluentEditor, by considering fluency-aware training criterion in the TSE training.
The subjective and objective experimental results on VCTK demonstrate that our textitFluentEditor outperforms all advanced baselines in terms of naturalness and fluency.
arXiv Detail & Related papers (2023-09-21T01:58:01Z) - Text-Only Domain Adaptation for End-to-End Speech Recognition through
Down-Sampling Acoustic Representation [67.98338382984556]
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains.
In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality.
Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
arXiv Detail & Related papers (2023-09-04T08:52:59Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.