Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization
- URL: http://arxiv.org/abs/2306.04233v1
- Date: Wed, 7 Jun 2023 08:23:58 GMT
- Title: Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization
- Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka,
Takatomo Kano, Atsunori Ogawa, Marc Delcroix
- Abstract summary: End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
- Score: 48.35495352015281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech summarization (E2E SSum) directly summarizes input speech
into easy-to-read short sentences with a single model. This approach is
promising because it, in contrast to the conventional cascade approach, can
utilize full acoustical information and mitigate to the propagation of
transcription errors. However, due to the high cost of collecting
speech-summary pairs, an E2E SSum model tends to suffer from training data
scarcity and output unnatural sentences. To overcome this drawback, we propose
for the first time to integrate a pre-trained language model (LM), which is
highly capable of generating natural sentences, into the E2E SSum decoder via
transfer learning. In addition, to reduce the gap between the independently
pre-trained encoder and decoder, we also propose to transfer the baseline E2E
SSum encoder instead of the commonly used automatic speech recognition encoder.
Experimental results show that the proposed model outperforms baseline and data
augmented models.
Related papers
- UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Regularizing End-to-End Speech Translation with Triangular Decomposition
Agreement [27.87144563354033]
We propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data.
Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines.
arXiv Detail & Related papers (2021-12-21T05:24:01Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.