InterFormer: Interactive Local and Global Features Fusion for Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2305.16342v2
- Date: Mon, 29 May 2023 11:28:27 GMT
- Title: InterFormer: Interactive Local and Global Features Fusion for Automatic
Speech Recognition
- Authors: Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei,
Song-Lu Chen, Feng Chen, Xu-Cheng Yin
- Abstract summary: Local and global features are essential for automatic speech recognition (ASR)
This paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR.
- Score: 30.242747907746132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The local and global features are both essential for automatic speech
recognition (ASR). Many recent methods have verified that simply combining
local and global features can further promote ASR performance. However, these
methods pay less attention to the interaction of local and global features, and
their series architectures are rigid to reflect local and global relationships.
To address these issues, this paper proposes InterFormer for interactive local
and global features fusion to learn a better representation for ASR.
Specifically, we combine the convolution block with the transformer block in a
parallel design. Besides, we propose a bidirectional feature interaction module
(BFIM) and a selective fusion module (SFM) to implement the interaction and
fusion of local and global features, respectively. Extensive experiments on
public ASR datasets demonstrate the effectiveness of our proposed InterFormer
and its superior performance over the other Transformer and Conformer models.
Related papers
- Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - A Dual-Stream Recurrence-Attention Network With Global-Local Awareness
for Emotion Recognition in Textual Dialog [41.72374101704424]
We propose a simple and effective Dual-stream Recurrence-Attention Network (DualRAN)
DualRAN eschews the complex components of current methods and focuses on combining recurrence-based methods with attention-based ones.
We show that DualRAN boosts the weighted F1 scores by 1.43% and 0.64% on the IEMOCAP and MELD datasets, respectively.
arXiv Detail & Related papers (2023-07-02T01:25:47Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Mutual Guidance and Residual Integration for Image Enhancement [43.282397174228116]
We propose a novel mutual guidance network (MGN) to perform effective bidirectional global-local information exchange.
In our design, we adopt a two-branch framework where one branch focuses more on modeling global relations while the other is committed to processing local information.
As a result, both the global and local branches can enjoy the merits of mutual information aggregation.
arXiv Detail & Related papers (2022-11-25T06:12:39Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - Masked Transformer for Neighhourhood-aware Click-Through Rate Prediction [74.52904110197004]
We propose Neighbor-Interaction based CTR prediction, which put this task into a Heterogeneous Information Network (HIN) setting.
In order to enhance the representation of the local neighbourhood, we consider four types of topological interaction among the nodes.
We conduct comprehensive experiments on two real world datasets and the experimental results show that our proposed method outperforms state-of-the-art CTR models significantly.
arXiv Detail & Related papers (2022-01-25T12:44:23Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z) - DF^2AM: Dual-level Feature Fusion and Affinity Modeling for RGB-Infrared
Cross-modality Person Re-identification [18.152310122348393]
RGB-infrared person re-identification is a challenging task due to the intra-class variations and cross-modality discrepancy.
We propose a Dual-level (i.e., local and global) Feature Fusion (DF2) module by learning attention for discnative feature from local to global manner.
To further mining the relationships between global features from person images, we propose an Affinities Modeling (AM) module.
arXiv Detail & Related papers (2021-04-01T03:12:56Z) - Global Context-Aware Progressive Aggregation Network for Salient Object
Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features.
We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.