RCMHA: Relative Convolutional Multi-Head Attention for Natural Language
Modelling
- URL: http://arxiv.org/abs/2308.03429v1
- Date: Mon, 7 Aug 2023 09:24:24 GMT
- Title: RCMHA: Relative Convolutional Multi-Head Attention for Natural Language
Modelling
- Authors: Herman Sugiharto, Aradea, Husni Mubarok
- Abstract summary: Relative Multi-Head Attention (RMHA) has superior accuracy, boasting a score of 0.572 in comparison to alternative attention modules.
RMHA emerges as the most frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA which necessitates 3.5 GB.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Attention module finds common usage in language modeling, presenting
distinct challenges within the broader scope of Natural Language Processing.
Multi-Head Attention (MHA) employs an absolute positional encoding, which
imposes limitations on token length and entails substantial memory consumption
during the processing of embedded inputs. The current remedy proposed by
researchers involves the utilization of relative positional encoding, similar
to the approach adopted in Transformer-XL or Relative Multi-Head Attention
(RMHA), albeit the employed architecture consumes considerable memory
resources. To address these challenges, this study endeavors to refine MHA,
leveraging relative positional encoding in conjunction with the Depth-Wise
Convolutional Layer architecture, which promises heightened accuracy coupled
with minimized memory usage. The proposed RCMHA framework entails the
modification of two integral components: firstly, the application of the
Depth-Wise Convolutional Layer to the input embedding, encompassing Query, Key,
and Value parameters; secondly, the incorporation of Relative Positional
Encoding into the attention scoring phase, harmoniously integrated with Scaled
Dot-Product Attention. Empirical experiments underscore the advantages of
RCMHA, wherein it exhibits superior accuracy, boasting a score of 0.572 in
comparison to alternative attention modules such as MHA, Multi-DConv-Head
Attention (MDHA), and RMHA. Concerning memory utilization, RMHA emerges as the
most frugal, demonstrating an average consumption of 2.98 GB, surpassing RMHA
which necessitates 3.5 GB.
Related papers
- Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z) - IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention [2.3959703715401903]
This paper proposes IMSE, a systematically optimized and ultra-lightweight network.<n>We introduce two core innovations: 1) replacing the MET module with Amplitude-Aware Linear Attention (MALA) and 2) replacing the Deformable Embedding (DE) module with Inception Depthwise Convolution (IDConv)<n>In experiments, IMSE significantly reduces the parameter count by 16.8% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373)
arXiv Detail & Related papers (2025-11-18T14:11:54Z) - MPCM-Net: Multi-scale network integrates partial attention convolution with Mamba for ground-based cloud image segmentation [13.137436418148896]
Ground-based cloud image segmentation is a critical research domain for photovoltaic power forecasting.<n>We propose MPCM-Net, a Multi-scale network that integrates Partial attention Convolutions with Mamba architectures to enhance segmentation accuracy and computational efficiency.<n>As a key contribution to the community, we also introduce and release a dataset CSRC, which is a clear-label, fine-grained segmentation benchmark designed to overcome the critical limitations of existing public datasets.
arXiv Detail & Related papers (2025-11-12T06:17:49Z) - HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference [8.057006406834462]
Large Language Models (LLMs) have driven a growing demand for efficient inference in latency-sensitive applications.<n>We present HALO, a heterogeneous memory-centric accelerator for these challenges.<n>We show that HALO achieves up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT.
arXiv Detail & Related papers (2025-10-03T02:20:17Z) - DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps [30.53564087005569]
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs)
Due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation.
We propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices.
arXiv Detail & Related papers (2025-02-21T19:06:01Z) - LM2: Large Memory Models [11.320069795732058]
This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module.
Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks.
arXiv Detail & Related papers (2025-02-09T22:11:42Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z) - Information-Theoretic Hashing for Zero-Shot Cross-Modal Retrieval [19.97731329580582]
In this paper, we consider a totally different way to construct (or learn) a common hamming space from an information-theoretic perspective.
Specifically, our AIA module takes the inspiration from the Principle of Relevant Information (PRI) to construct a common space that adaptively aggregates the intrinsic semantics of different modalities of data.
Our SPE module further generates the hashing codes of different modalities by preserving the similarity of intrinsic semantics with the element-wise Kullback-Leibler (KL) divergence.
arXiv Detail & Related papers (2022-09-26T08:05:20Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation
for Reference-based Super-Resolution [48.093500219958834]
We propose an Accelerated Multi-Scale Aggregation network (AMSA) for Reference-based Super-Resolution.
The proposed AMSA achieves superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2022-01-12T08:40:23Z) - Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.