Enhancing Speech Recognition Decoding via Layer Aggregation
- URL: http://arxiv.org/abs/2203.11325v1
- Date: Mon, 21 Mar 2022 20:28:06 GMT
- Title: Enhancing Speech Recognition Decoding via Layer Aggregation
- Authors: Tomer Wullach, Shlomo E. Chazan
- Abstract summary: We show that logits predicted using the top layers may hamper beam search from achieving optimal results.
We propose a prediction method that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence.
- Score: 7.056222499095849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently proposed speech recognition systems are designed to predict using
representations generated by their top layers, employing greedy decoding which
isolates each timestep from the rest of the sequence. Aiming for improved
performance, a beam search algorithm is frequently utilized and a language
model is incorporated to assist with ranking the top candidates. In this work,
we experiment with several speech recognition models and find that logits
predicted using the top layers may hamper beam search from achieving optimal
results. Specifically, we show that fined-tuned Wav2Vec 2.0 and HuBERT yield
highly confident predictions, and hypothesize that the predictions are based on
local information and may not take full advantage of the information encoded in
intermediate layers. To this end, we perform a layer analysis to reveal and
visualize how predictions evolve throughout the inference flow. We then propose
a prediction method that aggregates the top M layers, potentially leveraging
useful information encoded in intermediate layers and relaxing model
confidence. We showcase the effectiveness of our approach via beam search
decoding, conducting our experiments on Librispeech test and dev sets and
achieving WER, and CER reduction of up to 10% and 22%, respectively.
Related papers
- Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Don't Be So Sure! Boosting ASR Decoding via Confidence Relaxation [7.056222499095849]
beam search seeks the transcript with the greatest likelihood computed using the predicted distribution.
We show that recently proposed Self-Supervised Learning (SSL)-based ASR models tend to yield exceptionally confident predictions.
We propose a decoding procedure that improves the performance of fine-tuned ASR models.
arXiv Detail & Related papers (2022-12-27T06:42:26Z) - Exploring and Exploiting Multi-Granularity Representations for Machine
Reading Comprehension [13.191437539419681]
We propose a novel approach called Adaptive Bidirectional Attention-Capsule Network (ABA-Net)
ABA-Net adaptively exploits the source representations of different levels to the predictor.
We set the new state-of-the-art performance on the SQuAD 1.0 dataset.
arXiv Detail & Related papers (2022-08-18T10:14:32Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Representation Learning for Sequence Data with Deep Autoencoding
Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space.
We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step.
We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z) - Improved Speech Representations with Multi-Target Autoregressive
Predictive Coding [23.424410568555547]
We extend the hypothesis that hidden states that can accurately predict future frames are a useful representation for many downstream tasks.
We propose an auxiliary objective that serves as a regularization to improve generalization of the future frame prediction task.
arXiv Detail & Related papers (2020-04-11T01:09:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.