Lego-Features: Exporting modular encoder features for streaming and
deliberation ASR
- URL: http://arxiv.org/abs/2304.00173v1
- Date: Fri, 31 Mar 2023 23:33:21 GMT
- Title: Lego-Features: Exporting modular encoder features for streaming and
deliberation ASR
- Authors: Rami Botros, Rohit Prabhavalkar, Johan Schalkwyk, Ciprian Chelba, Tara
N. Sainath, Fran\c{c}oise Beaufays
- Abstract summary: We build on work that has begun to explore building encoders with modular encoded representations.
Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features.
Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders.
- Score: 34.23347991756358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In end-to-end (E2E) speech recognition models, a representational
tight-coupling inevitably emerges between the encoder and the decoder. We build
upon recent work that has begun to explore building encoders with modular
encoded representations, such that encoders and decoders from different models
can be stitched together in a zero-shot manner without further fine-tuning.
While previous research only addresses full-context speech models, we explore
the problem in a streaming setting as well. Our framework builds on top of
existing encoded representations, converting them to modular features, dubbed
as Lego-Features, without modifying the pre-trained model. The features remain
interchangeable when the model is retrained with distinct initializations.
Though sparse, we show that the Lego-Features are powerful when tested with
RNN-T or LAS decoders, maintaining high-quality downstream performance. They
are also rich enough to represent the first-pass prediction during two-pass
deliberation. In this scenario, they outperform the N-best hypotheses, since
they do not need to be supplemented with acoustic features to deliver the best
results. Moreover, generating the Lego-Features does not require beam search or
auto-regressive computation. Overall, they present a modular, powerful and
cheap alternative to the standard encoder output, as well as the N-best
hypotheses.
Related papers
- Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models.
Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z) - Are Decoder-Only Large Language Models the Silver Bullet for Code Search? [32.338318300589776]
This study presents the first systematic exploration of decoder-only large language models for code search.
We evaluate nine state-of-the-art decoder-only models using two fine-tuning methods, two datasets, and three model sizes.
Our findings reveal that fine-tuned CodeGemma significantly outperforms encoder-only models like UniXcoder.
arXiv Detail & Related papers (2024-10-29T17:05:25Z) - Chunked Attention-based Encoder-Decoder Model for Streaming Speech
Recognition [42.04873382667665]
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks.
A special end-of-chunk symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol.
We find that our model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
arXiv Detail & Related papers (2023-09-15T14:36:24Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework.
Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words.
Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z) - LegoNN: Building Modular Encoder-Decoder Models [117.47858131603112]
State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit.
No component of the model can be (re-)used without the others, making it impossible to share parts.
We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
arXiv Detail & Related papers (2022-06-07T14:08:07Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.