Attention-based Multi-hypothesis Fusion for Speech Summarization
- URL: http://arxiv.org/abs/2111.08201v1
- Date: Tue, 16 Nov 2021 03:00:29 GMT
- Title: Attention-based Multi-hypothesis Fusion for Speech Summarization
- Authors: Takatomo Kano, Atsunori Ogawa, Marc Delcroix, and Shinji Watanabe
- Abstract summary: Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
- Score: 83.04957603852571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech summarization, which generates a text summary from speech, can be
achieved by combining automatic speech recognition (ASR) and text summarization
(TS). With this cascade approach, we can exploit state-of-the-art models and
large training datasets for both subtasks, i.e., Transformer for ASR and
Bidirectional Encoder Representations from Transformers (BERT) for TS. However,
ASR errors directly affect the quality of the output summary in the cascade
approach. We propose a cascade speech summarization model that is robust to ASR
errors and that exploits multiple hypotheses generated by ASR to attenuate the
effect of ASR errors on the summary. We investigate several schemes to combine
ASR hypotheses. First, we propose using the sum of sub-word embedding vectors
weighted by their posterior values provided by an ASR system as an input to a
BERT-based TS system. Then, we introduce a more general scheme that uses an
attention-based fusion module added to a pre-trained BERT module to align and
combine several ASR hypotheses. Finally, we perform speech summarization
experiments on the How2 dataset and a newly assembled TED-based dataset that we
will release with this paper. These experiments show that retraining the
BERT-based TS system with these schemes can improve summarization performance
and that the attention-based fusion module is particularly effective.
Related papers
- Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC)
We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon.
We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z) - Factual Consistency Oriented Speech Recognition [23.754107608608106]
The proposed framework optimize the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions.
It is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries.
arXiv Detail & Related papers (2023-02-24T00:01:41Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - RED-ACE: Robust Error Detection for ASR using Confidence Embeddings [5.4693121539705984]
We propose to utilize the ASR system's word-level confidence scores for improving AED performance.
We add an ASR Confidence Embedding layer to the AED model's encoder, allowing us to jointly encode the confidence scores and the transcribed text into a contextualized representation.
arXiv Detail & Related papers (2022-03-14T15:13:52Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR
Hypotheses [0.0]
Spoken Language Understanding (SLU) parses speech into semantic structures like dialog acts and slots.
We show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime.
arXiv Detail & Related papers (2021-06-11T17:29:00Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.