Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
- URL: http://arxiv.org/abs/2603.00518v1
- Date: Sat, 28 Feb 2026 07:31:43 GMT
- Title: Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
- Authors: Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang,
- Abstract summary: We introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision.<n>Vision-TTT compresses the visual token sequence in a novel self-supervised learning manner.<n>Experiments show that textttVittt-T/S/B achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification.
- Score: 12.926316141126946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.
Related papers
- Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling [4.365909537198615]
We present a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy.<n>ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive Single Image Super-Resolution (SISR) of 0.712 and PSNR of 22.90 dB.
arXiv Detail & Related papers (2025-12-02T08:10:55Z) - ViT$^3$: Unlocking Test-Time Training in Vision [56.74014676094694]
Test-Time Training (TTT) has emerged as a promising direction for efficient sequence modeling.<n>We present a systematic empirical study of TTT designs for visual sequence modeling.<n>We conclude with the Vision Test-Time Training (ViT$3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation.
arXiv Detail & Related papers (2025-12-01T13:14:48Z) - Octic Vision Transformers: Quicker ViTs Through Equivariance [29.044546222577804]
We introduce Octic Vision Transformers (octic ViTs) to capture geometric symmetries.<n>Our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory.<n>We train octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K.
arXiv Detail & Related papers (2025-05-21T12:22:53Z) - Supervised Fine-tuning in turn Improves Visual Foundation Models [74.1760864718129]
Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
arXiv Detail & Related papers (2024-01-18T18:58:54Z) - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z) - ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights [61.36309876889977]
ViT-Lens enables efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space.
In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art.
We will release the results of ViT-Lens on more modalities in the near future.
arXiv Detail & Related papers (2023-08-20T07:26:51Z) - TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [61.0662744915659]
We propose an efficient vision-and-language pre-training model with textbfText-textbfRelevant textbfImage textbfPatch textbfSelection, namely TRIPS.<n> TRIPS reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference.
arXiv Detail & Related papers (2023-05-08T05:53:30Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.