Related papers: Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models

URL: http://arxiv.org/abs/2405.01468v1
Date: Thu, 2 May 2024 16:59:05 GMT
Title: Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Authors: Yifei Ming, Yixuan Li,
Abstract summary: We present a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation.
Score: 29.75562085178755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.

Related papers

Understanding the Transfer Limits of Vision Foundation Models [38.99867932557529]
Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks.<n>We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks.<n>Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures.<n>Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and
arXiv Detail & Related papers (2026-01-22T12:07:56Z)
Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency [3.5634988336513587]
We investigate how varying levels of query-context overlap affect model performance during both training and inference.<n>Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and model accelerates learning above a critical threshold.
arXiv Detail & Related papers (2025-05-20T12:58:07Z)
Mechanistic Interpretability of GPT-like Models on Summarization Tasks [2.4022340214033915]
This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks.<n>By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture.
arXiv Detail & Related papers (2025-05-20T02:15:11Z)
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback. Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data. We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z)
A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models [3.0495235326282186]
In deep learning, maintaining robustness against distribution shifts is critical. This work explores a broad range of possibilities to adapt vision-language foundation models at test-time.
arXiv Detail & Related papers (2024-05-23T18:27:07Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Improving Reinforcement Learning Efficiency with Auxiliary Tasks in Non-Visual Environments: A Comparison [0.0]
This study compares common auxiliary tasks based on, to the best of our knowledge, the only decoupled representation learning method for low-dimensional non-visual observations. Our findings show that representation learning with auxiliary tasks only provides performance gains in sufficiently complex environments.
arXiv Detail & Related papers (2023-10-06T13:22:26Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats. We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes. We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z)
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? [80.2534918595143]
Mutual information provides an appealing formalism for learning representations of data. This paper formalizes the sufficiency of a state representation for learning and representing the optimal policy. Surprisingly, we find that two of these objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP.
arXiv Detail & Related papers (2021-06-14T10:12:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.