Related papers: LegoNN: Building Modular Encoder-Decoder Models

LegoNN: Building Modular Encoder-Decoder Models

URL: http://arxiv.org/abs/2206.03318v2
Date: Tue, 11 Jul 2023 17:43:57 GMT
Title: LegoNN: Building Modular Encoder-Decoder Models
Authors: Siddharth Dalmia, Dmytro Okhonko, Mike Lewis, Sergey Edunov, Shinji Watanabe, Florian Metze, Luke Zettlemoyer, and Abdelrahman Mohamed
Abstract summary: State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
Score: 117.47858131603112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts, e.g. a high resourced decoder, across tasks. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning. To achieve this reusability, the interface between encoder and decoder modules is grounded to a sequence of marginal distributions over a pre-defined discrete vocabulary. We present two approaches for ingesting these marginals; one is differentiable, allowing the flow of gradients across the entire network, and the other is gradient-isolating. To enable the portability of decoder modules between MT tasks for different source languages and across other tasks like ASR, we introduce a modality agnostic encoder which consists of a length control mechanism to dynamically adapt encoders' output lengths in order to match the expected input length range of pre-trained decoders. We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused without any fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks, matching or beating the performance of baseline. After fine-tuning, LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5% relative WER reduction on the Europarl ASR task. To show how the approach generalizes, we compose a LegoNN ASR model from three modules -- each has been learned within different end-to-end trained models on three different datasets -- achieving an overall WER reduction of 19.5%.

Related papers

GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model [8.115489346573918]
GenEDA is a framework that aligns circuit encoders with decoders within a shared latent space. Built on this architecture, GenEDA enables three unprecedented generative reasoning tasks over netlists.
arXiv Detail & Related papers (2025-04-13T08:56:22Z)
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation [40.72168378706009]
We explore translation models that are universal, efficient, and easy to optimize. We apply large language models (LLMs) to NMT encoding and leave the NMT decoder unchanged. We construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes.
arXiv Detail & Related papers (2025-03-09T12:54:05Z)
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z)
Low-resource speech recognition and dialect identification of Irish in a multi-task framework [7.981589711420179]
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID) Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
arXiv Detail & Related papers (2024-05-02T13:54:39Z)
U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models. We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Lego-Features: Exporting modular encoder features for streaming and deliberation ASR [34.23347991756358]
We build on work that has begun to explore building encoders with modular encoded representations. Our framework builds on top of existing encoded representations, converting them to modular features, dubbed as Lego-Features. Though sparse, we show that the Lego-Features are powerful when tested with RNN-T or LAS decoders.
arXiv Detail & Related papers (2023-03-31T23:33:21Z)
Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z)
Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders [19.44855809470709]
Recently, universal neural machine translation (NMT) with shared encoder-decoder gained good performance on zero-shot translation. Unlike universal NMT, jointly trained language-specific encoders-decoders aim to achieve universal representation across non-shared modules. We study zero-shot translation using language-specific encoders-decoders.
arXiv Detail & Related papers (2021-02-12T15:36:33Z)
Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation [71.54816893482457]
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST) Our models are based on the original Transformer architecture but consist of two decoders, each responsible for one task (ASR or ST)
arXiv Detail & Related papers (2020-11-02T04:59:50Z)
Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting [6.893512627479196]
We propose two modified neural networks for accurate and efficient crowd counting. The first model is named M-SFANet, which is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN) The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet.
arXiv Detail & Related papers (2020-03-12T03:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.