Related papers: White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

URL: http://arxiv.org/abs/2311.13110v4
Date: Fri, 6 Sep 2024 07:40:40 GMT
Title: White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma,
Abstract summary: We show a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets.
Score: 27.58916930770997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .

Related papers

Attention-Only Transformers via Unrolled Subspace Denoising [19.832264029213515]
We derive a fully interpretable transformer architecture with only necessary components.<n>By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture.<n>Despite its simplicity, experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures.
arXiv Detail & Related papers (2025-06-04T09:53:14Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Variable-size Symmetry-based Graph Fourier Transforms for image compression [65.7352685872625]
We propose a new family of Symmetry-based Graph Fourier Transforms of variable sizes into a coding framework. Our proposed algorithm generates symmetric graphs on the grid by adding specific symmetrical connections between nodes. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection.
arXiv Detail & Related papers (2024-11-24T13:00:44Z)
Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need [3.218600495900291]
We argue that there are fundamental connections between semantic segmentation and compression. We derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT) Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter.
arXiv Detail & Related papers (2024-11-05T12:10:02Z)
High-Performance Transformers for Table Structure Recognition Need Early Convolutions [25.04573593082671]
Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. We design a lightweight visual encoder for table structure recognition (TSR) without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model.
arXiv Detail & Related papers (2023-11-09T18:20:52Z)
AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT) The proposed ICT can capture both global and local contexts from the latent representations. We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z)
White-Box Transformers via Sparse Rate Reduction [25.51855431031564]
We show a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Experiments show that these networks indeed learn to optimize the designed objective.
arXiv Detail & Related papers (2023-06-01T20:28:44Z)
Compression with Bayesian Implicit Neural Representations [16.593537431810237]
We propose overfitting variational neural networks to the data and compressing an approximate posterior weight sample using relative entropy coding instead of quantizing and entropy coding it. Experiments show that our method achieves strong performance on image and audio compression while retaining simplicity.
arXiv Detail & Related papers (2023-05-30T16:29:52Z)
The Devil Is in the Details: Window-based Attention for Image Compression [58.1577742463617]
Most existing learned image compression models are based on Convolutional Neural Networks (CNNs) In this paper, we study the effects of multiple kinds of attention mechanisms for local features learning, then introduce a more straightforward yet effective window-based local attention block. The proposed window-based attention is very flexible which could work as a plug-and-play component to enhance CNN and Transformer models.
arXiv Detail & Related papers (2022-03-16T07:55:49Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
Communication-Efficient Federated Learning via Quantized Compressed Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server. Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression. We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together. In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function. We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.