Related papers: Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

URL: http://arxiv.org/abs/2207.02971v1
Date: Wed, 6 Jul 2022 21:08:10 GMT
Title: Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
Authors: Yifan Peng, Siddharth Dalmia, Ian Lane, Shinji Watanabe
Abstract summary: Conformer has proven to be effective in many speech processing tasks. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer.
Score: 41.928263518867816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.

Related papers

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels [64.4442240213399]
We introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate(WER) improvements.
arXiv Detail & Related papers (2024-07-04T08:08:12Z)
Conformer LLMs -- Convolution Augmented Large Language Models [2.8935588665357077]
This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for large language models (LLMs) Transformers decoders effectively capture long-range dependencies over several modalities and form a core backbone of modern advancements in machine learning. This work showcases a robust speech architecture that can be integrated and adapted in a causal setup beyond speech applications for large-scale language modeling.
arXiv Detail & Related papers (2023-07-02T03:05:41Z)
ALOFT: A Lightweight MLP-like Architecture with Dynamic Low-frequency Transform for Domain Generalization [15.057335610188545]
Domain Domain (DG) aims to learn a model that generalizes well to unseen target domains utilizing multiple source domains without re-training. Most existing DG works are based on convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-03-21T08:36:34Z)
Generic Dependency Modeling for Multi-Party Conversation [32.25605889407403]
We present an approach to encoding the dependencies in the form of relative dependency encoding (ReDE) We show how to implement it in Transformers by modifying the computation of self-attention.
arXiv Detail & Related papers (2023-02-21T13:58:19Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR. Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality. Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z)
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality [113.1414517605892]
We propose a methodology, Locality Injection, to incorporate local priors into an FC layer. RepMLPNet is the first that seamlessly transfer to Cityscapes semantic segmentation.
arXiv Detail & Related papers (2021-12-21T10:28:17Z)
Multitask Pointer Network for Multi-Representational Parsing [0.34376560669160383]
We propose a transition-based approach that, by training a single model, can efficiently parse any input sentence with both constituent and dependency trees. We develop a Pointer Network architecture with two separate task-specific decoders and a common encoder, and follow a learning strategy to jointly train them.
arXiv Detail & Related papers (2020-09-21T10:04:07Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.