Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection
- URL: http://arxiv.org/abs/2506.14473v2
- Date: Fri, 27 Jun 2025 04:48:09 GMT
- Title: Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection
- Authors: Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin'ichi Satoh,
- Abstract summary: One-shot subset selection serves as an effective tool to reduce deep learning training costs.<n>Can FM-based subset selection outperform traditional IE-based methods across diverse datasets?<n>Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
- Score: 22.659400981807487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
Related papers
- COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning [19.173177969412656]
We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning.<n>Collage uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task-specific combination of multiple cues.<n>Collage outperforms state-of-the-art retrieval and multi-task learning approaches by 5.1% in simulation across 10 tasks, and by 16.6% in the real world across 6 tasks.
arXiv Detail & Related papers (2025-08-02T01:23:09Z) - Core-Set Selection for Data-efficient Land Cover Segmentation [16.89537279044251]
We propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets.<n>We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets.<n>This result shows the importance and potential of data-centric learning for the remote sensing domain.
arXiv Detail & Related papers (2025-05-02T12:22:08Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of
mixture-of-datasets [34.780870585656395]
We propose dataset-Aware Mixture-of-Experts, DAMEX.
We train the experts to become an expert' of a dataset by learning to route each dataset tokens to its mapped expert.
Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art.
arXiv Detail & Related papers (2023-11-08T18:55:24Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z) - Single-dataset Experts for Multi-dataset Question Answering [6.092171111087768]
We train a network on multiple datasets to generalize and transfer better to new datasets.
Our approach is to model multi-dataset question answering with a collection of single-dataset experts.
Simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance.
arXiv Detail & Related papers (2021-09-28T17:08:22Z) - Balancing Constraints and Submodularity in Data Subset Selection [43.03720397062461]
We show that one can achieve similar accuracy to traditional deep-learning models, while using less training data.
We propose a novel diversity driven objective function, and balancing constraints on class labels and decision boundaries using matroids.
arXiv Detail & Related papers (2021-04-26T19:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.