Related papers: Subspace Alignment for Vision-Language Model Test-time Adaptation

Subspace Alignment for Vision-Language Model Test-time Adaptation

URL: http://arxiv.org/abs/2601.08139v1
Date: Tue, 13 Jan 2026 02:02:41 GMT
Title: Subspace Alignment for Vision-Language Model Test-time Adaptation
Authors: Zhichen Zeng, Wenxuan Bao, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Xuying Ning, Yuchen Yan, Chen Luo, Monica Xiao Cheng, Jingrui He, Hanghang Tong,
Abstract summary: Vision-language models (VLMs) are vulnerable to distribution shifts.<n>Existing test-time adaptation methods rely on zero-shot predictions as pseudo-labels for self-training.<n>We propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions.
Score: 82.83192844597593
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.

Related papers

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks [43.473390101413166]
Self-Supervised Learning for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks.<n>This study aims to bridge the gap by systematically evaluating the use of unmodified features across image classification and segmentation tasks.
arXiv Detail & Related papers (2025-09-18T11:46:07Z)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model [27.56988000960972]
We introduce a new framework based on a dual context of both domain-shared and class-specific contexts. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in Large Language Models. We also formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens.
arXiv Detail & Related papers (2024-07-05T13:15:29Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric. Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z)
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains. Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.