Related papers: BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers

BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers

URL: http://arxiv.org/abs/2509.12768v1
Date: Tue, 16 Sep 2025 07:33:21 GMT
Title: BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers
Authors: Mohammed Al-Habib, Zuping Zhang, Abdulrahman Noman,
Abstract summary: We propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST)<n>BATR-FST progressively improves token representations and maintains a robust inductive bias for few-shot classification.<n>It achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.
Score: 2.5680214354539803
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.

Related papers

DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement [13.38679135071682]
We propose a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules.<n>We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency.
arXiv Detail & Related papers (2025-05-24T02:23:46Z)
Prior2Former -- Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation [74.55677741919035]
We propose Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning.<n>P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments.<n>Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes.
arXiv Detail & Related papers (2025-04-07T08:53:14Z)
BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Enhancing cross-domain detection: adaptive class-aware contrastive transformer [15.666766743738531]
Insufficient labels in the target domain exacerbate issues of class imbalance and model performance degradation. We propose a class-aware cross domain detection transformer based on the adversarial learning and mean-teacher framework.
arXiv Detail & Related papers (2024-01-24T07:11:05Z)
Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm. FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z)
Fourier Test-time Adaptation with Multi-level Consistency for Robust Classification [10.291631977766672]
We propose a novel approach called Fourier Test-time Adaptation (FTTA) to integrate input and model tuning. FTTA builds a reliable multi-level consistency measurement of paired inputs for achieving self-supervised of prediction. It was extensively validated on three large classification datasets with different modalities and organs.
arXiv Detail & Related papers (2023-06-05T02:29:38Z)
ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers. We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z)
FER-former: Multi-modal Transformer for Facial Expression Recognition [14.219492977523682]
A novel multifarious supervision-steering Transformer for Facial Expression Recognition is proposed in this paper. Our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. Experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
arXiv Detail & Related papers (2023-03-23T02:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.