BASS: Block-wise Adaptation for Speech Summarization
- URL: http://arxiv.org/abs/2307.08217v1
- Date: Mon, 17 Jul 2023 03:31:36 GMT
- Title: BASS: Block-wise Adaptation for Speech Summarization
- Authors: Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita
Singh, Bhiksha Raj
- Abstract summary: We develop a method that allows one to train summarization models on very long sequences in an incremental manner.
Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block.
Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.
- Score: 47.518484305407185
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-end speech summarization has been shown to improve performance over
cascade baselines. However, such models are difficult to train on very large
inputs (dozens of minutes or hours) owing to compute restrictions and are hence
trained with truncated model inputs. Truncation leads to poorer models, and a
solution to this problem rests in block-wise modeling, i.e., processing a
portion of the input frames at a time. In this paper, we develop a method that
allows one to train summarization models on very long sequences in an
incremental manner. Speech summarization is realized as a streaming process,
where hypothesis summaries are updated every block based on new acoustic
information. We devise and test strategies to pass semantic context across the
blocks. Experiments on the How2 dataset demonstrate that the proposed
block-wise training method improves by 3 points absolute on ROUGE-L over a
truncated input baseline.
Related papers
- Exploring Efficient Foundational Multi-modal Models for Video Summarization [15.418001616659808]
Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space.
We propose a plug-and-play video language model, using texts generated from each input modality into the language model.
We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods.
arXiv Detail & Related papers (2024-10-09T20:07:06Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - SegPrompt: Using Segmentation Map as a Better Prompt to Finetune Deep
Models for Kidney Stone Classification [62.403510793388705]
Deep learning has produced encouraging results for kidney stone classification using endoscope images.
The shortage of annotated training data poses a severe problem in improving the performance and generalization ability of the trained model.
We propose SegPrompt to alleviate the data shortage problems by exploiting segmentation maps from two aspects.
arXiv Detail & Related papers (2023-03-15T01:30:48Z) - Latent Iterative Refinement for Modular Source Separation [44.78689915209527]
Traditional source separation approaches train deep neural network models end-to-end with all the data available at once.
We argue that we can significantly increase resource efficiency during both training and inference stages.
arXiv Detail & Related papers (2022-11-22T00:02:57Z) - Blockwise Sequential Model Learning for Partially Observable
Reinforcement Learning [14.642266310020505]
This paper proposes a new sequential model learning architecture to solve partially observable Markov decision problems.
The proposed architecture generates a latent variable in each data block with a length of multiple timesteps and passes the most relevant information to the next block for policy optimization.
Numerical results show that the proposed method significantly outperforms previous methods in various partially observable environments.
arXiv Detail & Related papers (2021-12-10T05:38:24Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow.
Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property.
We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.