Shared DIFF Transformer
- URL: http://arxiv.org/abs/2501.17900v1
- Date: Wed, 29 Jan 2025 09:29:07 GMT
- Title: Shared DIFF Transformer
- Authors: Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Xiangju Wang,
- Abstract summary: DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise.
We propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns.
This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities.
- Score: 4.289692335378565
- License:
- Abstract: DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.
Related papers
- RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers [11.003945673813488]
We propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl.
We evaluate the relevance of each layer in the Diffusion Transformer to the control information.
We then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations.
arXiv Detail & Related papers (2025-02-20T09:10:05Z) - DINT Transformer [5.990713912057883]
DIFF Transformer addresses the issue of irrelevant context interference by introducing a differential attention mechanism.
We propose DINT Transformer, which extends DIFF Transformer by incorporating a differential-integral mechanism.
arXiv Detail & Related papers (2025-01-29T08:53:29Z) - Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.
Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.
We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z) - Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z) - Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution [32.29219284419944]
Cross-refinement adaptive feature modulation transformer (CRAFT)
We introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency.
Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods.
arXiv Detail & Related papers (2023-08-09T15:38:36Z) - Over-the-Air Federated Multi-Task Learning via Model Sparsification and
Turbo Compressed Sensing [48.19771515107681]
We propose an over-the-air FMTL framework, where multiple learning tasks deployed on edge devices share a non-orthogonal fading channel under the coordination of an edge server.
In OA-FMTL, the local updates of edge devices are sparsified, compressed, and then sent over the uplink channel in a superimposed fashion.
We analyze the performance of the proposed OA-FMTL framework together with the M-Turbo-CS algorithm.
arXiv Detail & Related papers (2022-05-08T08:03:52Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.