Related papers: Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

URL: http://arxiv.org/abs/2503.18334v2
Date: Mon, 31 Mar 2025 10:28:04 GMT
Title: Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models
Authors: Haotian Zhai, Xinyu Chen, Can Zhang, Tianming Sha, Ruirui Li,
Abstract summary: Test-time adaptation (TTA) of visual language models has attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks.<n>We introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called "Cache, Residual, Gaussian" (CRG)<n> Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.
Score: 13.157596316463621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time adaptation (TTA) of visual language models has recently attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks. However, existing cache-based TTA methods have certain limitations. They mainly rely on the accuracy of cached feature labels, and the presence of noisy pseudo-labels can cause these features to deviate from their true distribution. This makes cache retrieval methods based on similarity matching highly sensitive to outliers or extreme samples. Moreover, current methods lack effective mechanisms to model class distributions, which limits their ability to fully exploit the potential of cached information. To address these challenges, we introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called "Cache, Residual, Gaussian" (CRG). This method not only employs learnable residual parameters to better align positive and negative visual prototypes with text prototypes, thereby optimizing the quality of cached features, but also incorporates Gaussian Discriminant Analysis (GDA) to dynamically model intra-class feature distributions, further mitigating the impact of noisy features. Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.

Related papers

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models [41.11005178050448]
ProfilingDiT is a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Our framework achieves significant acceleration while maintaining visual fidelity across comprehensive quality metrics.
arXiv Detail & Related papers (2025-04-04T03:30:15Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations.<n>By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups.<n>Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
Robust Confinement State Classification with Uncertainty Quantification through Ensembled Data-Driven Methods [39.27649013012046]
We develop methods for confinement state classification with uncertainty quantification and model robustness.<n>We focus on off-line analysis for TCV discharges, distinguishing L-mode, H-mode, and an in-between dithering phase (D)<n>A dataset of 302 TCV discharges is fully labeled, and will be publicly released.
arXiv Detail & Related papers (2025-02-24T18:25:22Z)
Noisy Test-Time Adaptation in Vision-Language Models [73.14136220844156]
Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. This paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector.
arXiv Detail & Related papers (2025-02-20T14:37:53Z)
Breaking Determinism: Fuzzy Modeling of Sequential Recommendation Using Discrete State Space Diffusion Model [66.91323540178739]
Sequential recommendation (SR) aims to predict items that users may be interested in based on their historical behavior. We revisit SR from a novel information-theoretic perspective and find that sequential modeling methods fail to adequately capture randomness and unpredictability of user behavior. Inspired by fuzzy information processing theory, this paper introduces the fuzzy sets of interaction sequences to overcome the limitations and better capture the evolution of users' real interests.
arXiv Detail & Related papers (2024-10-31T14:52:01Z)
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z)
ChiroDiff: Modelling chirographic data with Diffusion Models [132.5223191478268]
We introduce a powerful model-class namely "Denoising Diffusion Probabilistic Models" or DDPMs for chirographic data. Our model named "ChiroDiff", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate.
arXiv Detail & Related papers (2023-04-07T15:17:48Z)
Post-Processing Temporal Action Detection [134.26292288193298]
Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. We introduce a novel model-agnostic post-processing method without model redesign and retraining.
arXiv Detail & Related papers (2022-11-27T19:50:37Z)
A Fair Loss Function for Network Pruning [70.35230425589592]
We introduce the performance weighted loss function, a simple modified cross-entropy loss function that can be used to limit the introduction of biases during pruning. Experiments using the CelebA, Fitzpatrick17k and CIFAR-10 datasets demonstrate that the proposed method is a simple and effective tool.
arXiv Detail & Related papers (2022-11-18T15:17:28Z)
Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods [24.190587751595455]
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Existing approaches use latent variable estimation to model the noisy sources. We show that for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters. We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches.
arXiv Detail & Related papers (2020-02-27T07:51:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.