Text-Only Domain Adaptation for End-to-End Speech Recognition through
Down-Sampling Acoustic Representation
- URL: http://arxiv.org/abs/2309.02459v1
- Date: Mon, 4 Sep 2023 08:52:59 GMT
- Title: Text-Only Domain Adaptation for End-to-End Speech Recognition through
Down-Sampling Acoustic Representation
- Authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You,
Dan Su, Dong Yu, Helen Meng
- Abstract summary: Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains.
In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality.
Our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain.
- Score: 67.98338382984556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mapping two modalities, speech and text, into a shared representation space,
is a research topic of using text-only data to improve end-to-end automatic
speech recognition (ASR) performance in new domains. However, the length of
speech representation and text representation is inconsistent. Although the
previous method up-samples the text representation to align with acoustic
modality, it may not match the expected actual duration. In this paper, we
proposed novel representations match strategy through down-sampling acoustic
representation to align with text modality. By introducing a continuous
integrate-and-fire (CIF) module generating acoustic representations consistent
with token length, our ASR model can learn unified representations from both
modalities better, allowing for domain adaptation using text-only data of the
target domain. Experiment results of new domain data demonstrate the
effectiveness of the proposed method.
Related papers
- ASTRA: Aligning Speech and Text Representations for Asr without Sampling [20.925353958092874]
ASTRA is a novel method for improving Automatic Speech Recognition (ASR) through text injection.
Unlike prevailing techniques, ASTRA eliminates the need for sampling to match sequence lengths between speech and text modalities.
arXiv Detail & Related papers (2024-06-10T15:39:04Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Augmenting text for spoken language understanding with Large Language
Models [13.240782495441275]
We show how to use transcript-semantic parse data (unpaired text) without corresponding speech.
Experiments show that unpaired text from existing and new domains improves performance by 2% and 30% in absolute Exact Match (EM) respectively.
We propose to prompt Large Language Models (LLMs) to generate unpaired text for existing and new domains.
arXiv Detail & Related papers (2023-09-17T22:25:34Z) - Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length.
We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z) - Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning
Fine-tuning [11.585880477614495]
We show that our model can gain a Word Error Rate (WER) reduction of up to 33% on unseen datasets from various domains.
We extend our method to text-only fine-tuning to achieve domain sensitivity as well as domain adaptation.
arXiv Detail & Related papers (2023-07-18T06:45:43Z) - Text-only Domain Adaptation using Unified Speech-Text Representation in
Transducer [12.417314740402587]
We present a method to learn Unified Speech-Text Representation in Conformer Transducer(USTR-CT) to enable fast domain adaptation using the text-only corpus.
Experiments on adapting LibriSpeech to SPGISpeech show the proposed method reduces the word error rate(WER) by relatively 44% on the target domain.
arXiv Detail & Related papers (2023-06-07T00:33:02Z) - A Simple Baseline for Domain Adaptation in End to End ASR Systems Using
Synthetic Data [1.14219428942199]
We propose a simple baseline technique for domain adaptation in end-to-end speech recognition models.
We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine.
We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates.
arXiv Detail & Related papers (2022-06-22T12:07:38Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z) - Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [49.768327669098674]
We propose an end-to-end trainable text spotting approach named Text Perceptron.
It first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information.
Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies.
arXiv Detail & Related papers (2020-02-17T08:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.