Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
- URL: http://arxiv.org/abs/2511.00833v1
- Date: Sun, 02 Nov 2025 07:04:12 GMT
- Title: Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
- Authors: Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li,
- Abstract summary: We introduce Visual-Contrast Attention (VCA), a drop-in replacement for Multi-Head Self-Attention (MHSA)<n>VCA injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n N.<n>The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic.
- Score: 34.77694214755808
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.
Related papers
- D3R-Net: Dual-Domain Denoising Reconstruction Network for Robust Industrial Anomaly Detection [0.0]
Unsupervised anomaly detection (UAD) is a key ingredient of automated visual inspection in modern manufacturing.<n>We introduce D3R-Net, a Dual-Domain Denoising Reconstruction framework that couples a self-supervised 'healing' task with frequency-aware regularization.<n>In addition to the spatial mean squared error, we employ a Fast Fourier Transform (FFT) magnitude loss that encourages consistency in the frequency domain.
arXiv Detail & Related papers (2026-01-27T23:21:59Z) - EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification [0.5394291557377919]
Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost.<n>We introduce EVCC, a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet.<n> Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models.
arXiv Detail & Related papers (2025-11-24T02:11:19Z) - Octic Vision Transformers: Quicker ViTs Through Equivariance [29.044546222577804]
We introduce Octic Vision Transformers (octic ViTs) to capture geometric symmetries.<n>Our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory.<n>We train octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K.
arXiv Detail & Related papers (2025-05-21T12:22:53Z) - Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have brought a new wave of research in the field of computer vision.<n>This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of aerial images.<n>We show that the novel combined weighted loss function significantly boosts the CNN model's performance compared to transfer learning with ViT.
arXiv Detail & Related papers (2024-11-14T00:18:04Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.