Related papers: Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

URL: http://arxiv.org/abs/2506.02671v1
Date: Tue, 03 Jun 2025 09:16:51 GMT
Title: Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet
Authors: Xiao Chen, Jiazhen Huang, Qinting Jiang, Fanding Huang, Xianghua Fu, Jingyan Jiang, Zhi Wang,
Abstract summary: Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference.<n>We introduce SAIL, a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation.
Score: 5.977269026037707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference. However, existing approaches often incur substantial computational costs and exhibit poor scalability, primarily due to sample-wise adaptation granularity and reliance on costly auxiliary designs such as data augmentation. To address these limitations, we introduce SAIL (Small Aid, Big Leap), a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation. As SAIL's core, a frozen pre-trained VLM collaborates with AdaptNet through a confidence-based interpolation weight, generating robust predictions during inference. These predictions serve as self-supervised targets to align AdaptNet's outputs through efficient batch-wise processing, dramatically reducing computational costs without modifying the VLM or requiring memory caches. To mitigate catastrophic forgetting during continual adaptation, we propose a gradient-aware reset strategy driven by a gradient drift indicator (GDI), which dynamically detects domain transitions and strategically resets AdaptNet for stable adaptation. Extensive experiments across diverse benchmarks on two scenarios demonstrate that SAIL achieves state-of-the-art performance while maintaining low computational costs. These results highlight SAIL's effectiveness, efficiency and scalability for real-world deployment. The code will be released upon acceptance.

Related papers

LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge [12.772499009055194]
We propose a lightweight, quantized-adaptive framework for Vision-Language Models (VLMs)<n>We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware.<n> Experiments show that LQA improves overall adaptation performance by 4.5%, uses less memory, and significantly outperforms gradient-based TTA methods.
arXiv Detail & Related papers (2026-02-08T07:37:37Z)
AdaptFly: Prompt-Guided Adaptation of Foundation Models for Low-Altitude UAV Networks [10.80018338292861]
Low-altitude Unmanned Aerial Vehicle (UAV) networks rely on robust semantic segmentation as a foundational enabler for distributed sensing-communication-control co-design.<n>We propose AdaptFly, a prompt-guided test-time adaptation framework that adjusts segmentation models without weight updates.<n>Experiments on UAVid and VDD benchmarks, along with real-world UAV deployments under diverse weather conditions, demonstrate that AdaptFly significantly improves segmentation accuracy and robustness.
arXiv Detail & Related papers (2025-11-13T00:20:37Z)
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization [61.55616421408666]
Low-Altitude Economy Networks (LAENets) have enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection.<n> onboard vision (VLMs) offer inference for real-time inference but limited onboard dynamic network conditions.<n>We propose a UAV-enabled LAENet system that improves communication efficiency under dynamic LAENet conditions.
arXiv Detail & Related papers (2025-10-11T05:11:21Z)
Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation [5.013248430919224]
We propose a hybrid training methodology that enables efficient on-device domain adaptation.<n>The method begins with a deep neural network trained offline using Backpropagation to achieve high initial performance.<n> Predictive Coding is employed for online adaptation, allowing the model to recover accuracy lost due to shifts in the input data distribution.
arXiv Detail & Related papers (2025-09-24T16:03:27Z)
Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity [24.47280082248569]
Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks.<n>The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare.<n>We propose a novel C$2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence.
arXiv Detail & Related papers (2025-07-21T13:24:38Z)
Memory Efficient Transformer Adapter for Dense Predictions [42.413108132475855]
We propose META, a memory-efficient ViT adapter that can improve the model's memory efficiency and decrease memory time consumption.<n>Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations.<n> META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2025-02-04T03:19:33Z)
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [123.07450481623124]
We propose Skip Tuning as a novel paradigm for adapting vision-language models to downstream tasks.<n>Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules.
arXiv Detail & Related papers (2024-12-16T07:33:23Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
EUDA: An Efficient Unsupervised Domain Adaptation via Self-Supervised Vision Transformer [21.59850502993888]
Unsupervised domain adaptation (UDA) aims to mitigate the domain shift issue, where the distribution of training (source) data differs from that of testing (target) data. Many models have been developed to tackle this problem, and recently vision transformers (ViTs) have shown promising results. This paper introduces an efficient model that reduces trainable parameters and allows for adjustable complexity.
arXiv Detail & Related papers (2024-07-31T03:29:28Z)
Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts. We propose a test-time Forward-Optimization Adaptation (FOA) method. FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z)
Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing [8.88477151877883]
High-capacity pre-trained models have revolutionized problem-solving in computer vision. We propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme.
arXiv Detail & Related papers (2023-10-10T01:04:15Z)
Visual Prompt Tuning for Test-time Domain Adaptation [48.16620171809511]
We propose a simple recipe called data-efficient prompt tuning (DePT) with two key ingredients. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. With much fewer parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks, but also superior data efficiency.
arXiv Detail & Related papers (2022-10-10T16:45:13Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Deep Adaptive Inference Networks for Single Image Super-Resolution [72.7304455761067]
Single image super-resolution (SISR) has witnessed tremendous progress in recent years owing to the deployment of deep convolutional neural networks (CNNs) In this paper, we take a step forward to address this issue by leveraging the adaptive inference networks for deep SISR (AdaDSR) Our AdaDSR involves an SISR model as backbone and a lightweight adapter module which takes image features and resource constraint as input and predicts a map of local network depth.
arXiv Detail & Related papers (2020-04-08T10:08:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.