Speech Summarization using Restricted Self-Attention
- URL: http://arxiv.org/abs/2110.06263v1
- Date: Tue, 12 Oct 2021 18:21:23 GMT
- Title: Speech Summarization using Restricted Self-Attention
- Authors: Roshan Sharma, Shruti Palaskar, Alan W Black and Florian Metze
- Abstract summary: We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
- Score: 79.89680891246827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech summarization is typically performed by using a cascade of speech
recognition and text summarization models. End-to-end modeling of speech
summarization models is challenging due to memory and compute constraints
arising from long input audio sequences. Recent work in document summarization
has inspired methods to reduce the complexity of self-attentions, which enables
transformer models to handle long sequences. In this work, we introduce a
single model optimized end-to-end for speech summarization. We apply the
restricted self-attention technique from text-based models to speech models to
address the memory and compute constraints. We demonstrate that the proposed
model learns to directly summarize speech for the How-2 corpus of instructional
videos. The proposed end-to-end model outperforms the previously proposed
cascaded model by 3 points absolute on ROUGE. Further, we consider the spoken
language understanding task of predicting concepts from speech inputs and show
that the proposed end-to-end model outperforms the cascade model by 4 points
absolute F-1.
Related papers
- BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval [3.347768376390811]
We investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance.
Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.
arXiv Detail & Related papers (2024-08-19T19:56:10Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - End-to-end model for named entity recognition from speech without paired
training data [12.66131972249388]
We propose an approach to build an end-to-end neural model to extract semantic information.
Our approach is based on the use of an external model trained to generate a sequence of vectorial representations from text.
Experiments on named entity recognition, carried out on the QUAERO corpus, show that this approach is very promising.
arXiv Detail & Related papers (2022-04-02T08:14:27Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.