Related papers: DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

URL: http://arxiv.org/abs/2402.02622v2
Date: Thu, 21 Mar 2024 10:57:40 GMT
Title: DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Authors: Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi,
Abstract summary: We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models.
Score: 34.643717080240584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

Related papers

Scaling Recommender Transformers to One Billion Parameters [0.0]
We present a recipe for training large transformer recommenders with up to a billion parameters.<n>We show that autoregressive learning on user histories naturally decomposes into two subtasks, feedback prediction and next-item prediction.<n>We report a successful deployment of our proposed architecture on a large-scale music platform serving millions of users.
arXiv Detail & Related papers (2025-07-21T18:30:43Z)
Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition [10.302458835329539]
We introduce a new method, namely Transformer Re- parameterization, to boost the performance of lightweight Transformer models. Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models.
arXiv Detail & Related papers (2024-11-14T10:36:19Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z)
CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation. We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z)
Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption [45.00129952368691]
Homomorphic Encryption (HE) has emerged as one of the most promising approaches in deep learning. We introduce the first transformer, providing the first demonstration of secure inference over HE with transformers. Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications.
arXiv Detail & Related papers (2023-11-15T00:23:58Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions. Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z)
Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers [4.899818550820576]
We construct a Legendre Memory Unit based model that introduces a general prior for sequence processing. We show that our new architecture attains the same accuracy as transformers with 10x fewer tokens.
arXiv Detail & Related papers (2021-10-05T23:20:37Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.