Related papers: InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

URL: http://arxiv.org/abs/2305.16342v2
Date: Mon, 29 May 2023 11:28:27 GMT
Title: InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition
Authors: Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei, Song-Lu Chen, Feng Chen, Xu-Cheng Yin
Abstract summary: Local and global features are essential for automatic speech recognition (ASR) This paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR.
Score: 30.242747907746132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models.

Related papers

Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID [82.12123628480371]
Unsupervised person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. We propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up objective for specific fine-grained patterns emphasized by each modality.
arXiv Detail & Related papers (2025-04-27T13:58:12Z)
MambaMIC: An Efficient Baseline for Microscopic Image Classification with State Space Models [12.182070604073585]
We propose a vision backbone for Microscopic Image Classification (MIC) tasks, named MambaMIC. Specifically, we introduce a Local-Global dual-branch aggregation module: the MambaMIC Block. In the local branch, we use local convolutions to capture pixel similarity, mitigating local pixel forgetting and enhancing perception. In the global branch, SSM extracts global dependencies, while Locally Aware Enhanced Filter reduces channel redundancy and local pixel forgetting.
arXiv Detail & Related papers (2024-09-12T10:01:33Z)
Attention-Guided Multi-scale Interaction Network for Face Super-Resolution [46.42710591689621]
CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. How to fuse multi-scale features and promote their complementarity is crucial for enhancing FSR. Our design allows the free flow of multi-scale features from within modules and between encoder and decoder.
arXiv Detail & Related papers (2024-09-01T02:53:24Z)
Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR) MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations. During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z)
Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module. Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z)
Mutual Guidance and Residual Integration for Image Enhancement [43.282397174228116]
We propose a novel mutual guidance network (MGN) to perform effective bidirectional global-local information exchange. In our design, we adopt a two-branch framework where one branch focuses more on modeling global relations while the other is committed to processing local information. As a result, both the global and local branches can enjoy the merits of mutual information aggregation.
arXiv Detail & Related papers (2022-11-25T06:12:39Z)
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels. Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z)
Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images. We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM) The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z)
Masked Transformer for Neighhourhood-aware Click-Through Rate Prediction [74.52904110197004]
We propose Neighbor-Interaction based CTR prediction, which put this task into a Heterogeneous Information Network (HIN) setting. In order to enhance the representation of the local neighbourhood, we consider four types of topological interaction among the nodes. We conduct comprehensive experiments on two real world datasets and the experimental results show that our proposed method outperforms state-of-the-art CTR models significantly.
arXiv Detail & Related papers (2022-01-25T12:44:23Z)
Conformer: Local Features Coupling Global Representations for Visual Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z)
DF^2AM: Dual-level Feature Fusion and Affinity Modeling for RGB-Infrared Cross-modality Person Re-identification [18.152310122348393]
RGB-infrared person re-identification is a challenging task due to the intra-class variations and cross-modality discrepancy. We propose a Dual-level (i.e., local and global) Feature Fusion (DF2) module by learning attention for discnative feature from local to global manner. To further mining the relationships between global features from person images, we propose an Affinities Modeling (AM) module.
arXiv Detail & Related papers (2021-04-01T03:12:56Z)
Global Context-Aware Progressive Aggregation Network for Salient Object Detection [117.943116761278]
We propose a novel network named GCPANet to integrate low-level appearance features, high-level semantic features, and global context features. We show that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2020-03-02T04:26:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.