LegoNN: Building Modular Encoder-Decoder Models
- URL: http://arxiv.org/abs/2206.03318v2
- Date: Tue, 11 Jul 2023 17:43:57 GMT
- Title: LegoNN: Building Modular Encoder-Decoder Models
- Authors: Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji
Watanabe, Florian Metze, Luke Zettlemoyer, and Abdelrahman Mohamed
- Abstract summary: State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit.
No component of the model can be (re-)used without the others, making it impossible to share parts.
We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
- Score: 117.47858131603112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or
automatic speech recognition (ASR)) are constructed and trained end-to-end as
an atomic unit. No component of the model can be (re-)used without the others,
making it impossible to share parts, e.g. a high resourced decoder, across
tasks. We describe LegoNN, a procedure for building encoder-decoder
architectures in a way so that its parts can be applied to other tasks without
the need for any fine-tuning. To achieve this reusability, the interface
between encoder and decoder modules is grounded to a sequence of marginal
distributions over a pre-defined discrete vocabulary. We present two approaches
for ingesting these marginals; one is differentiable, allowing the flow of
gradients across the entire network, and the other is gradient-isolating. To
enable the portability of decoder modules between MT tasks for different source
languages and across other tasks like ASR, we introduce a modality agnostic
encoder which consists of a length control mechanism to dynamically adapt
encoders' output lengths in order to match the expected input length range of
pre-trained decoders. We present several experiments to demonstrate the
effectiveness of LegoNN models: a trained language generation LegoNN decoder
module from German-English (De-En) MT task can be reused without any
fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT
tasks, matching or beating the performance of baseline. After fine-tuning,
LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5%
relative WER reduction on the Europarl ASR task. To show how the approach
generalizes, we compose a LegoNN ASR model from three modules -- each has been
learned within different end-to-end trained models on three different datasets
-- achieving an overall WER reduction of 19.5%.
Related papers
- 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Low-resource speech recognition and dialect identification of Irish in a multi-task framework [7.981589711420179]
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID)
Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
arXiv Detail & Related papers (2024-05-02T13:54:39Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Lego-Features: Exporting modular encoder features for streaming and
deliberation ASR [34.23347991756358]
We build on work that has begun to explore building encoders with modular encoded representations.
Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features.
Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders.
arXiv Detail & Related papers (2023-03-31T23:33:21Z) - Improving Zero-shot Neural Machine Translation on Language-specific
Encoders-Decoders [19.44855809470709]
Recently, universal neural machine translation (NMT) with shared encoder-decoder gained good performance on zero-shot translation.
Unlike universal NMT, jointly trained language-specific encoders-decoders aim to achieve universal representation across non-shared modules.
We study zero-shot translation using language-specific encoders-decoders.
arXiv Detail & Related papers (2021-02-12T15:36:33Z) - Dual-decoder Transformer for Joint Automatic Speech Recognition and
Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST)
Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z) - Encoder-Decoder Based Convolutional Neural Networks with
Multi-Scale-Aware Modules for Crowd Counting [6.893512627479196]
We propose two modified neural networks for accurate and efficient crowd counting.
The first model is named M-SFANet, which is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN)
The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet.
arXiv Detail & Related papers (2020-03-12T03:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.