ViT$^3$: Unlocking Test-Time Training in Vision
- URL: http://arxiv.org/abs/2512.01643v1
- Date: Mon, 01 Dec 2025 13:14:48 GMT
- Title: ViT$^3$: Unlocking Test-Time Training in Vision
- Authors: Dongchen Han, Yining Li, Tianyu Li, Zixuan Cao, Ziming Wang, Jun Song, Yu Cheng, Bo Zheng, Gao Huang,
- Abstract summary: Test-Time Training (TTT) has emerged as a promising direction for efficient sequence modeling.<n>We present a systematic empirical study of TTT designs for visual sequence modeling.<n>We conclude with the Vision Test-Time Training (ViT$3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation.
- Score: 56.74014676094694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
Related papers
- Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training [12.926316141126946]
We introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision.<n>Vision-TTT compresses the visual token sequence in a novel self-supervised learning manner.<n>Experiments show that textttVittt-T/S/B achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification.
arXiv Detail & Related papers (2026-02-28T07:31:43Z) - Transformed Multi-view 3D Shape Features with Contrastive Learning [1.5292939414871212]
Vision Transformers (ViTs) based architectures achieve promising results in multi-view 3D analysis.<n>ViTs' ability to understand overall shapes and contrastive learning's effectiveness overcomes the need for extensive labeled data.
arXiv Detail & Related papers (2025-10-22T18:29:48Z) - Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [109.84783476090028]
We introduce $textbfZebra-CoT$, a diverse large-scale dataset with 182,384 samples.<n>We focus on four categories of tasks where sketching or visual reasoning is especially natural.<n>Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains.
arXiv Detail & Related papers (2025-07-22T16:35:36Z) - CTA: Cross-Task Alignment for Better Test Time Training [10.54024648915477]
Test-Time Training (TTT) has emerged as an effective method to enhance model robustness.<n>We introduce CTA (Cross-Task Alignment), a novel approach for improving TTT.<n>We show substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2025-07-07T17:33:20Z) - Octic Vision Transformers: Quicker ViTs Through Equivariance [29.044546222577804]
We introduce Octic Vision Transformers (octic ViTs) to capture geometric symmetries.<n>Our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory.<n>We train octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K.
arXiv Detail & Related papers (2025-05-21T12:22:53Z) - Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection [128.40330044868293]
Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains.
ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets.
arXiv Detail & Related papers (2023-12-12T18:28:59Z) - ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights [61.36309876889977]
ViT-Lens enables efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space.
In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art.
We will release the results of ViT-Lens on more modalities in the near future.
arXiv Detail & Related papers (2023-08-20T07:26:51Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.