Related papers: Enhancing Speech Recognition Decoding via Layer Aggregation

Enhancing Speech Recognition Decoding via Layer Aggregation

URL: http://arxiv.org/abs/2203.11325v1
Date: Mon, 21 Mar 2022 20:28:06 GMT
Title: Enhancing Speech Recognition Decoding via Layer Aggregation
Authors: Tomer Wullach, Shlomo E. Chazan
Abstract summary: We show that logits predicted using the top layers may hamper beam search from achieving optimal results. We propose a prediction method that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence.
Score: 7.056222499095849
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently proposed speech recognition systems are designed to predict using representations generated by their top layers, employing greedy decoding which isolates each timestep from the rest of the sequence. Aiming for improved performance, a beam search algorithm is frequently utilized and a language model is incorporated to assist with ranking the top candidates. In this work, we experiment with several speech recognition models and find that logits predicted using the top layers may hamper beam search from achieving optimal results. Specifically, we show that fined-tuned Wav2Vec 2.0 and HuBERT yield highly confident predictions, and hypothesize that the predictions are based on local information and may not take full advantage of the information encoded in intermediate layers. To this end, we perform a layer analysis to reveal and visualize how predictions evolve throughout the inference flow. We then propose a prediction method that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence. We showcase the effectiveness of our approach via beam search decoding, conducting our experiments on Librispeech test and dev sets and achieving WER, and CER reduction of up to 10% and 22%, respectively.

Related papers

Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination. Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
Don't Be So Sure! Boosting ASR Decoding via Confidence Relaxation [7.056222499095849]
beam search seeks the transcript with the greatest likelihood computed using the predicted distribution. We show that recently proposed Self-Supervised Learning (SSL)-based ASR models tend to yield exceptionally confident predictions. We propose a decoding procedure that improves the performance of fine-tuned ASR models.
arXiv Detail & Related papers (2022-12-27T06:42:26Z)
Exploring and Exploiting Multi-Granularity Representations for Machine Reading Comprehension [13.191437539419681]
We propose a novel approach called Adaptive Bidirectional Attention-Capsule Network (ABA-Net) ABA-Net adaptively exploits the source representations of different levels to the predictor. We set the new state-of-the-art performance on the SQuAD 1.0 dataset.
arXiv Detail & Related papers (2022-08-18T10:14:32Z)
Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning. We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Representation Learning for Sequence Data with Deep Autoencoding Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space. We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step. We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z)
Improved Speech Representations with Multi-Target Autoregressive Predictive Coding [23.424410568555547]
We extend the hypothesis that hidden states that can accurately predict future frames are a useful representation for many downstream tasks. We propose an auxiliary objective that serves as a regularization to improve generalization of the future frame prediction task.
arXiv Detail & Related papers (2020-04-11T01:09:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.