Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models
- URL: http://arxiv.org/abs/2601.23253v1
- Date: Fri, 30 Jan 2026 18:21:45 GMT
- Title: Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models
- Authors: Yi Zhang, Chun-Wun Cheng, Angelica I. Aviles-Rivero, Zhihai He, Liang-Jie Zhang,
- Abstract summary: Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa)<n>TaTa leverages Brownian Distance Covariance to dynamically adapt vision-language models to new domains without training or back-propagation.<n>Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.
- Score: 16.03043781097689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models suffer performance degradation under domain shift, limiting real-world applicability. Existing test-time adaptation methods are computationally intensive, rely on back-propagation, and often focus on single modalities. To address these issues, we propose Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa). TaTa leverages Brownian Distance Covariance-a powerful statistical measure that captures both linear and nonlinear dependencies via pairwise distances-to dynamically adapt VLMs to new domains without training or back-propagation. This not only improves efficiency but also enhances stability by avoiding disruptive weight updates. TaTa further integrates attribute-enhanced prompting to improve vision-language inference with descriptive visual cues. Combined with dynamic clustering and pseudo-label refinement, it effectively recalibrates the model for novel visual contexts. Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.
Related papers
- Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics [6.208369829942616]
We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm.<n>ULD unifies the efficiency of model-free methods with the representational strengths of model-based approaches.<n> evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari.
arXiv Detail & Related papers (2026-02-13T06:06:56Z) - Gated Differentiable Working Memory for Long-Context Language Modeling [80.27483324685434]
We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
arXiv Detail & Related papers (2026-01-19T10:00:33Z) - Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data [89.96277093034547]
We introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization.<n>We show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training.
arXiv Detail & Related papers (2025-12-29T12:35:51Z) - Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z) - Efficient Test-Time Scaling for Small Vision-Language Models [14.654047034885288]
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models.<n>Existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models.<n>We propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision.
arXiv Detail & Related papers (2025-10-03T23:49:06Z) - ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection [28.75333303894706]
ToReMi is a novel framework that adjusts training sample weights according to their topical associations and observed learning patterns.<n>Our experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches.
arXiv Detail & Related papers (2025-04-01T12:06:42Z) - Enhanced Online Test-time Adaptation with Feature-Weight Cosine Alignment [7.991720491452191]
Online Test-Time Adaptation (OTTA) has emerged as an effective strategy to handle distributional shifts.
This paper introduces a novel cosine alignment optimization approach with a dual-objective loss function.
Our method outperforms state-of-the-art techniques and sets a new benchmark in multiple datasets.
arXiv Detail & Related papers (2024-05-12T05:57:37Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - BDC-Adapter: Brownian Distance Covariance for Better Vision-Language
Reasoning [26.75156572762166]
We introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning.
BDC can model all possible relations, providing a robust metric for measuring feature dependence.
We present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction.
arXiv Detail & Related papers (2023-09-03T19:45:02Z) - Revisiting Consistency Regularization for Semi-Supervised Learning [80.28461584135967]
We propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss.
Experimental results show that our model defines a new state of the art for various datasets and settings.
arXiv Detail & Related papers (2021-12-10T20:46:13Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.