UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with
Vision-Language Models
- URL: http://arxiv.org/abs/2307.11227v1
- Date: Thu, 20 Jul 2023 20:45:13 GMT
- Title: UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with
Vision-Language Models
- Authors: Xin Li, Sima Behpour, Thang Doan, Wenbin He, Liang Gou, Liu Ren
- Abstract summary: We introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models for data pre-selection.
Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation.
We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%.
- Score: 24.50445616970387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we investigate the task of data pre-selection, which aims to
select instances for labeling from an unlabeled dataset through a single pass,
thereby optimizing performance for undefined downstream tasks with a limited
annotation budget. Previous approaches to data pre-selection relied solely on
visual features extracted from foundation models, such as CLIP and BLIP-2, but
largely ignored the powerfulness of text features. In this work, we argue that,
with proper design, the joint feature space of both vision and text can yield a
better representation for data pre-selection. To this end, we introduce UP-DP,
a simple yet effective unsupervised prompt learning approach that adapts
vision-language models, like BLIP-2, for data pre-selection. Specifically, with
the BLIP-2 parameters frozen, we train text prompts to extract the joint
features with improved representation, ensuring a diverse cluster structure
that covers the entire dataset. We extensively compare our method with the
state-of-the-art using seven benchmark datasets in different settings,
achieving up to a performance gain of 20%. Interestingly, the prompts learned
from one dataset demonstrate significant generalizability and can be applied
directly to enhance the feature extraction of BLIP-2 from other datasets. To
the best of our knowledge, UP-DP is the first work to incorporate unsupervised
prompt learning in a vision-language model for data pre-selection.
Related papers
- The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph [45.51085356985464]
We introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams.
This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity.
GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape.
arXiv Detail & Related papers (2024-10-16T11:16:34Z) - CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection [30.46562066023117]
We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection.
Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes.
Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of our proposed method.
arXiv Detail & Related papers (2024-10-08T08:36:12Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets.
We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.