Monotonic segmental attention for automatic speech recognition
- URL: http://arxiv.org/abs/2210.14742v1
- Date: Wed, 26 Oct 2022 14:21:23 GMT
- Title: Monotonic segmental attention for automatic speech recognition
- Authors: Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schl\"uter, Hermann Ney
- Abstract summary: We introduce a novel segmental-attention model for automatic speech recognition.
We compare global-attention and different segmental-attention modeling variants.
We observe that the segmental model generalizes much better to long sequences of up to several minutes.
- Score: 45.036436385637295
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce a novel segmental-attention model for automatic speech
recognition. We restrict the decoder attention to segments to avoid quadratic
runtime of global attention, better generalize to long sequences, and
eventually enable streaming. We directly compare global-attention and different
segmental-attention modeling variants. We develop and compare two separate
time-synchronous decoders, one specifically taking the segmental nature into
account, yielding further improvements. Using time-synchronous decoding for
segmental models is novel and a step towards streaming applications. Our
experiments show the importance of a length model to predict the segment
boundaries. The final best segmental-attention model using segmental decoding
performs better than global-attention, in contrast to other monotonic attention
approaches in the literature. Further, we observe that the segmental model
generalizes much better to long sequences of up to several minutes.
Related papers
- Linguistically Motivated Sign Language Segmentation [51.06873383204105]
We consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing.
arXiv Detail & Related papers (2023-10-21T10:09:34Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - Smart Speech Segmentation using Acousto-Linguistic Features with
look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation.
On average, our models improve segmentation-F0.5 score by 9.8% over baseline.
For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z) - Word Segmentation on Discovered Phone Units with Dynamic Programming and
Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly.
This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units.
I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - A study of latent monotonic attention variants [65.73442960456013]
End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic.
We present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable.
We show that our monotonic models perform as good as the global soft attention model.
arXiv Detail & Related papers (2021-03-30T22:35:56Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.