Speech-text based multi-modal training with bidirectional attention for
improved speech recognition
- URL: http://arxiv.org/abs/2211.00325v1
- Date: Tue, 1 Nov 2022 08:25:11 GMT
- Title: Speech-text based multi-modal training with bidirectional attention for
improved speech recognition
- Authors: Yuhang Yang, Haihua Xu, Hao Huang, Eng Siong Chng, Sheng Li
- Abstract summary: We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
- Score: 26.47071418582507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as
well as much more unpaired text data by multi-modal training, one needs to
address two problems: 1) the synchronicity of feature sampling rates between
speech and language (aka text data); 2) the homogeneity of the learned
representations from two encoders. In this paper we propose to employ a novel
bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder
(bottom layers) and text encoder with a multi-modal learning method. The BiAM
is to facilitate feature sampling rate exchange, realizing the quality of the
transformed features for the one kind to be measured in another space, with
diversified objective functions. As a result, the speech representations are
enriched with more linguistic information, while the representations generated
by the text encoder are more similar to corresponding speech ones, and
therefore the shared ASR models are more amenable for unpaired text data
pretraining. To validate the efficacy of the proposed method, we perform two
categories of experiments with or without extra unpaired text data.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word
error rate reduction (WERR) with only paired data learning, while 9.23% WERR
when more unpaired text data is employed.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Text Injection for Capitalization and Turn-Taking Prediction in Speech
Models [45.94388391693112]
This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model.
We show results demonstrating that our text injection method boosts capitalization performance for long-tail data.
arXiv Detail & Related papers (2023-08-14T18:28:04Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - MAESTRO: Matched Speech Text Representations through Modality Matching [35.566604806335626]
Maestro is a self-supervised training method to unify representations learnt from speech and text modalities.
We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative reduction in Word Error Rate (WER)
We establish a new state-of-the-art (SOTA) on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
arXiv Detail & Related papers (2022-04-07T12:48:16Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Learning to Match Jobs with Resumes from Sparse Interaction Data using
Multi-View Co-Teaching Network [83.64416937454801]
Job-resume interaction data is sparse and noisy, which affects the performance of job-resume match algorithms.
We propose a novel multi-view co-teaching network from sparse interaction data for job-resume matching.
Our model is able to outperform state-of-the-art methods for job-resume matching.
arXiv Detail & Related papers (2020-09-25T03:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.