Variable Attention Masking for Configurable Transformer Transducer
Speech Recognition
- URL: http://arxiv.org/abs/2211.01438v2
- Date: Tue, 18 Apr 2023 09:59:52 GMT
- Title: Variable Attention Masking for Configurable Transformer Transducer
Speech Recognition
- Authors: Pawel Swietojanski, Stefan Braun, Dogan Can, Thiago Fraga da Silva,
Arnab Ghoshal, Takaaki Hori, Roger Hsiao, Henry Mason, Erik McDermott, Honza
Silovsky, Ruchir Travadi, Xiaodan Zhuang
- Abstract summary: We study the use of attention masking in transformer transducer based speech recognition.
We show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking.
We also show that variable masking improves the accuracy by up to 8% relative in the acoustic re-scoring scenario.
- Score: 23.546294634238677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the use of attention masking in transformer transducer
based speech recognition for building a single configurable model for different
deployment scenarios. We present a comprehensive set of experiments comparing
fixed masking, where the same attention mask is applied at every frame, with
chunked masking, where the attention mask for each frame is determined by chunk
boundaries, in terms of recognition accuracy and latency. We then explore the
use of variable masking, where the attention masks are sampled from a target
distribution at training time, to build models that can work in different
configurations. Finally, we investigate how a single configurable model can be
used to perform both first pass streaming recognition and second pass acoustic
rescoring. Experiments show that chunked masking achieves a better accuracy vs
latency trade-off compared to fixed masking, both with and without FastEmit. We
also show that variable masking improves the accuracy by up to 8% relative in
the acoustic re-scoring scenario.
Related papers
- Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation [14.001679439460359]
Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters.
We analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities.
arXiv Detail & Related papers (2024-10-25T14:43:32Z) - Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Rethinking Remote Sensing Change Detection With A Mask View [6.3921187411592655]
Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different stamps time to assess changes in geographical entities and environmental factors.
To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer.
arXiv Detail & Related papers (2024-06-21T17:27:58Z) - MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification [29.15203530375882]
Change (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature.
We propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs.
It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals.
arXiv Detail & Related papers (2024-04-18T11:05:15Z) - Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - M2T: Masking Transformers Twice for Faster Decoding [39.6722311745861]
We show how bidirectional transformers trained for masked token prediction can be applied to neural image compression.
We demonstrate that predefined, deterministic schedules perform as well or better for image compression.
arXiv Detail & Related papers (2023-04-14T14:25:44Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z) - SipMask: Spatial Information Preservation for Fast Image and Video
Instance Segmentation [149.242230059447]
We propose a fast single-stage instance segmentation method called SipMask.
It preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box.
In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings.
arXiv Detail & Related papers (2020-07-29T12:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.