Related papers: DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models

DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models

URL: http://arxiv.org/abs/2506.10390v3
Date: Mon, 29 Sep 2025 09:14:20 GMT
Title: DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models
Authors: Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin,
Abstract summary: We introduce DART, a fully differentiable Dynamic Region Adaptive Tokenizer.<n>DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes.<n>The impact of this approach is profound, where a DART-Small matches the performance of a DeiT-Base86 with nearly double the inference speed.
Score: 45.12546316524245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation. Code is available at https://github.com/HCPLab-SYSU/DART.

Related papers

DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning [6.3691159627915015]
We introduce DART, a lightweight training-free method that performs on-the-fly context-based pruning.<n>DART monitors shifts in distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters.<n>It achieves accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity and 3x better ROUGE-L scores with respect to static-masked pruning.
arXiv Detail & Related papers (2026-01-30T06:48:16Z)
Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting [15.751224470424786]
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details.<n>We propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity.<n>Two learnable parameters, and beta, are optimized end-to-end to balance global reasoning and local perception.
arXiv Detail & Related papers (2025-11-24T11:55:22Z)
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z)
DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition [16.386674597850778]
We propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy.<n>Our approach achieves state-of-the-art performance across most benchmark datasets.
arXiv Detail & Related papers (2025-07-24T14:29:30Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network [9.214772627896156]
The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images.<n>The proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes.
arXiv Detail & Related papers (2025-05-11T10:55:14Z)
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
V2M: Visual 2-Dimensional Mamba for Image Representation Learning [68.51380287151927]
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences. Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. We propose a Visual 2-Dimensional Mamba model as a complete solution, which directly processes image tokens in the 2D space.
arXiv Detail & Related papers (2024-10-14T11:11:06Z)
Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing [7.202931445597172]
Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing. In this paper, we introduce the nested algorithm and apply the Nested-TNT to image classification tasks. The experiment confirms that the proposed model has achieved better classification performance over ViT and TNT, exceeding 2.25%, 1.1% on dataset CIFAR10 and 2.78%, 0.25% on dataset FLOWERS102 respectively.
arXiv Detail & Related papers (2024-04-20T17:56:14Z)
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs. SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z)
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition. DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z)
Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks. These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information. We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing [10.996162201540695]
We develop efficient translation variant convolution (TVConv) for layout-aware visual processing. TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures.
arXiv Detail & Related papers (2022-03-20T08:29:06Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches. We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z)
Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts. We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos. Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization. We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.