Leveraging Large Text Corpora for End-to-End Speech Summarization
- URL: http://arxiv.org/abs/2303.00978v1
- Date: Thu, 2 Mar 2023 05:19:49 GMT
- Title: Leveraging Large Text Corpora for End-to-End Speech Summarization
- Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka,
Atsunori Ogawa, Marc Delcroix, Ryo Masumura
- Abstract summary: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
- Score: 58.673480990374635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly
generate summary sentences from speech. Compared with the cascade approach,
which combines automatic speech recognition (ASR) and text summarization
models, the E2E approach is more promising because it mitigates ASR errors,
incorporates nonverbal information, and simplifies the overall system. However,
since collecting a large amount of paired data (i.e., speech and summary) is
difficult, the training data is usually insufficient to train a robust E2E SSum
system. In this paper, we present two novel methods that leverage a large
amount of external text summarization data for E2E SSum training. The first
technique is to utilize a text-to-speech (TTS) system to generate synthesized
speech, which is used for E2E SSum training with the text summary. The second
is a TTS-free method that directly inputs phoneme sequence instead of
synthesized speech to the E2E SSum model. Experiments show that our proposed
TTS- and phoneme-based methods improve several metrics on the How2 dataset. In
particular, our best system outperforms a previous state-of-the-art one by a
large margin (i.e., METEOR score improvements of more than 6 points). To the
best of our knowledge, this is the first work to use external language
resources for E2E SSum. Moreover, we report a detailed analysis of the How2
dataset to confirm the validity of our proposed E2E SSum system.
Related papers
- Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation [44.332577357986324]
Sen-SSum generates text summaries from a spoken document in a sentence-by-sentence manner.
We present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum.
arXiv Detail & Related papers (2024-08-01T00:18:21Z) - Improving End-to-End Speech Processing by Efficient Text Data
Utilization with Latent Synthesis [17.604583337593677]
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data.
We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models.
arXiv Detail & Related papers (2023-10-09T03:10:49Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Towards End-to-end Speech-to-text Summarization [0.0]
Speech-to-text (S2T) summarization is a time-saving technique for filtering and keeping up with the broadcast news uploaded online on a daily basis.
End-to-end (E2E) modelling of S2T abstractive summarization is a promising approach that offers the possibility of generating rich latent representations.
We model S2T summarization both with a cascade and an E2E system for a corpus of broadcast news in French.
arXiv Detail & Related papers (2023-06-06T15:22:16Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Have best of both worlds: two-pass hybrid and E2E cascading framework
for speech recognition [71.30167252138048]
Hybrid and end-to-end (E2E) systems have different error patterns in the speech recognition results.
This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model.
We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system.
arXiv Detail & Related papers (2021-10-10T20:11:38Z) - Exploring Transfer Learning For End-to-End Spoken Language Understanding [8.317084844841323]
An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
arXiv Detail & Related papers (2020-12-15T19:02:15Z) - End-to-end Named Entity Recognition from English Speech [51.22888702264816]
We introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimize the ASR and NER tagger components.
We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
arXiv Detail & Related papers (2020-05-22T13:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.