RCMHA: Relative Convolutional Multi-Head Attention for Natural Language
Modelling
- URL: http://arxiv.org/abs/2308.03429v1
- Date: Mon, 7 Aug 2023 09:24:24 GMT
- Title: RCMHA: Relative Convolutional Multi-Head Attention for Natural Language
Modelling
- Authors: Herman Sugiharto, Aradea, Husni Mubarok
- Abstract summary: Relative Multi-Head Attention (RMHA) has superior accuracy, boasting a score of 0.572 in comparison to alternative attention modules.
RMHA emerges as the most frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA which necessitates 3.5 GB.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Attention module finds common usage in language modeling, presenting
distinct challenges within the broader scope of Natural Language Processing.
Multi-Head Attention (MHA) employs an absolute positional encoding, which
imposes limitations on token length and entails substantial memory consumption
during the processing of embedded inputs. The current remedy proposed by
researchers involves the utilization of relative positional encoding, similar
to the approach adopted in Transformer-XL or Relative Multi-Head Attention
(RMHA), albeit the employed architecture consumes considerable memory
resources. To address these challenges, this study endeavors to refine MHA,
leveraging relative positional encoding in conjunction with the Depth-Wise
Convolutional Layer architecture, which promises heightened accuracy coupled
with minimized memory usage. The proposed RCMHA framework entails the
modification of two integral components: firstly, the application of the
Depth-Wise Convolutional Layer to the input embedding, encompassing Query, Key,
and Value parameters; secondly, the incorporation of Relative Positional
Encoding into the attention scoring phase, harmoniously integrated with Scaled
Dot-Product Attention. Empirical experiments underscore the advantages of
RCMHA, wherein it exhibits superior accuracy, boasting a score of 0.572 in
comparison to alternative attention modules such as MHA, Multi-DConv-Head
Attention (MDHA), and RMHA. Concerning memory utilization, RMHA emerges as the
most frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA
which necessitates 3.5 GB.
Related papers
- DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps [30.53564087005569]
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs)
Due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation.
We propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices.
arXiv Detail & Related papers (2025-02-21T19:06:01Z) - LM2: Large Memory Models [11.320069795732058]
This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module.
Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks.
arXiv Detail & Related papers (2025-02-09T22:11:42Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - Information-Theoretic Hashing for Zero-Shot Cross-Modal Retrieval [19.97731329580582]
In this paper, we consider a totally different way to construct (or learn) a common hamming space from an information-theoretic perspective.
Specifically, our AIA module takes the inspiration from the Principle of Relevant Information (PRI) to construct a common space that adaptively aggregates the intrinsic semantics of different modalities of data.
Our SPE module further generates the hashing codes of different modalities by preserving the similarity of intrinsic semantics with the element-wise Kullback-Leibler (KL) divergence.
arXiv Detail & Related papers (2022-09-26T08:05:20Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation
for Reference-based Super-Resolution [48.093500219958834]
We propose an Accelerated Multi-Scale Aggregation network (AMSA) for Reference-based Super-Resolution.
The proposed AMSA achieves superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2022-01-12T08:40:23Z) - Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.