Internal language model estimation through explicit context vector
learning for attention-based encoder-decoder ASR
- URL: http://arxiv.org/abs/2201.11627v1
- Date: Wed, 26 Jan 2022 07:47:27 GMT
- Title: Internal language model estimation through explicit context vector
learning for attention-based encoder-decoder ASR
- Authors: Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang
- Abstract summary: We propose two novel approaches to estimate the biased ILM based on Listen-Attend-Spell (LAS) models.
Experiments show that the ILMs estimated by the proposed methods achieve the lowest perplexity.
- Score: 19.233720469733797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An end-to-end (E2E) speech recognition model implicitly learns a biased
internal language model (ILM) during training. To fused an external LM during
inference, the scores produced by the biased ILM need to be estimated and
subtracted. In this paper we propose two novel approaches to estimate the
biased ILM based on Listen-Attend-Spell (LAS) models. The simpler method is to
replace the context vector of the LAS decoder at every time step with a
learnable vector. The other more advanced method is to use a simple
feed-forward network to directly map query vectors to context vectors, making
the generation of the context vectors independent of the LAS encoder. Both the
learnable vector and the mapping network are trained on the transcriptions of
the training data to minimize the perplexity while all the other parameters of
the LAS model is fixed. Experiments show that the ILMs estimated by the
proposed methods achieve the lowest perplexity. In addition, they also
significantly outperform the shallow fusion method and two previously proposed
Internal Language Model Estimation (ILME) approaches on multiple datasets.
Related papers
- A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning [49.62044186504516]
In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences.
Recent studies have shown that the context encoder generates noise and makes the model robust to the choice of context.
This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context.
arXiv Detail & Related papers (2024-07-03T12:50:49Z) - Effective internal language model training and fusion for factorized transducer model [26.371223360905557]
Internal language model (ILM) of the neural transducer has been widely studied.
We propose a novel ILM training and decoding strategy for factorized transducer models.
arXiv Detail & Related papers (2024-04-02T08:01:05Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - Prompt Optimization via Adversarial In-Context Learning [51.18075178593142]
adv-ICL is implemented as a two-player game between a generator and a discriminator.
The generator tries to generate realistic enough output to fool the discriminator.
We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques.
arXiv Detail & Related papers (2023-12-05T09:44:45Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Extracting Latent Steering Vectors from Pretrained Language Models [14.77762401765532]
We show that latent vectors can be extracted directly from language model decoders without fine-tuning.
Experiments show that there exist steering vectors, which, when added to the hidden states of the language model, generate a target sentence nearly perfectly.
We find that distances between steering vectors reflect sentence similarity when evaluated on a textual similarity benchmark.
arXiv Detail & Related papers (2022-05-10T19:04:37Z) - Investigating Methods to Improve Language Model Integration for
Attention-based Encoder-Decoder ASR Models [107.86965028729517]
Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions.
We propose several novel methods to estimate the ILM directly from the AED model.
arXiv Detail & Related papers (2021-04-12T15:16:03Z) - A Convolutional Deep Markov Model for Unsupervised Speech Representation
Learning [32.59760685342343]
Probabilistic Latent Variable Models provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.
In this work, we propose ConvDMM, a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks.
When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods.
arXiv Detail & Related papers (2020-06-03T21:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.