Active Learning via Vision-Language Model Adaptation with Open Data
- URL: http://arxiv.org/abs/2506.01724v1
- Date: Mon, 02 Jun 2025 14:30:04 GMT
- Title: Active Learning via Vision-Language Model Adaptation with Open Data
- Authors: Tong Wang, Jiaqi Wang, Shu Kong,
- Abstract summary: Active Learning (AL) aims to reduce data labeling expense by strategically selecting the most informative data for labeling and model training.<n>Recent AL methods have explored VLMs but have not leveraged publicly available open data, such as VLM's preunderrepresented data.<n>In this work, we leverage such data by retrieving task-relevant examples to augment the taskspecific examples.
- Score: 33.33210375336842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained on web-scale open data, VLMs offer powerful capabilities for solving downstream tasks after being adapted to task-specific labeled data. Yet, data labeling can be expensive and may demand domain expertise. Active Learning (AL) aims to reduce this expense by strategically selecting the most informative data for labeling and model training. Recent AL methods have explored VLMs but have not leveraged publicly available open data, such as VLM's pretraining data. In this work, we leverage such data by retrieving task-relevant examples to augment the task-specific examples. As expected, incorporating them significantly improves AL. Given that our method exploits open-source VLM and open data, we refer to it as Active Learning with Open Resources (ALOR). Additionally, most VLM-based AL methods use prompt tuning (PT) for model adaptation, likely due to its ability to directly utilize pretrained parameters and the assumption that doing so reduces the risk of overfitting to limited labeled data. We rigorously compare popular adaptation approaches, including linear probing (LP), finetuning (FT), and contrastive tuning (CT). We reveal two key findings: (1) All adaptation approaches benefit from incorporating retrieved data, and (2) CT resoundingly outperforms other approaches across AL methods. Further analysis of retrieved data reveals a naturally imbalanced distribution of task-relevant classes, exposing inherent biases within the VLM. This motivates our novel Tail First Sampling (TFS) strategy for AL, an embarrassingly simple yet effective method that prioritizes sampling data from underrepresented classes to label. Extensive experiments demonstrate that our final method, contrastively finetuning VLM on both retrieved and TFS-selected labeled data, significantly outperforms existing methods.
Related papers
- SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models [7.44035983292392]
We propose a self-learning framework for large language models (LLMs) inspired by human learning pattern.<n>This framework takes a fine-tuning (SFT) dataset in a specific domain as input.<n>We show that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning.
arXiv Detail & Related papers (2025-05-23T04:50:54Z) - Transferable text data distillation by trajectory matching [27.826518926355295]
The data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set.<n>In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching.<n> Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS.
arXiv Detail & Related papers (2025-04-14T02:39:26Z) - Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
Unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies.<n>Our method SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification [34.37262622415682]
We propose a new adaptation framework called Data Adaptive Traceback.
Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data.
We adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning.
arXiv Detail & Related papers (2024-07-11T18:01:58Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - Your Vision-Language Model Itself Is a Strong Filter: Towards
High-Quality Instruction Tuning with Data Selection [59.11430077029321]
We introduce a novel dataset selection method, Self-Filter, for vision-language models (VLMs)
In the first stage, we devise a scoring network to evaluate the difficulty of training instructions, which is co-trained with the VLM.
In the second stage, we use the trained score net to measure the difficulty of each instruction, select the most challenging samples, and penalize similar samples to encourage diversity.
arXiv Detail & Related papers (2024-02-19T20:08:48Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining
Paradigms [36.04356511882304]
Self-supervised learning (SSL) has demonstrated promising results on a wide range of applications.
There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other.
arXiv Detail & Related papers (2020-06-19T05:21:00Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.