Related papers: Maximizing V-information for Pre-training Superior Foundation Models

Maximizing V-information for Pre-training Superior Foundation Models

URL: http://arxiv.org/abs/2408.07107v2
Date: Fri, 16 Aug 2024 12:19:44 GMT
Title: Maximizing V-information for Pre-training Superior Foundation Models
Authors: Wenxuan Yang, Weimin Tan, Hanyu Zhang, Bo Yan,
Abstract summary: Pre-training foundation models on large-scale datasets demonstrate exceptional performance. Recent research questions whether an increase in pre-training data always leads to enhanced model performance. We develop an optimal data-effective learning method to maximize V-information.
Score: 14.78688545049181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training foundation models on large-scale datasets demonstrates exceptional performance. However, recent research questions this traditional notion, exploring whether an increase in pre-training data always leads to enhanced model performance. To address this issue, data-effective learning approaches have been introduced. However, current methods in this area lack a clear standard for sample selection. Our experiments reveal that by maximizing V-information, sample selection can be framed as an optimization problem, enabling effective improvement in model performance even with fewer samples. Under this guidance, we develop an optimal data-effective learning method (OptiDEL) to maximize V-information. The OptiDEL method generates hard samples to achieve or even exceed the performance of models trained on the full dataset while using substantially less data. We compare the OptiDEL method with state-of-the-art approaches finding that OptiDEL consistently outperforms existing approaches across different datasets, with foundation models trained on only 5% of the pre-training data surpassing the performance of those trained on the full dataset.

Related papers

LLM Data Selection and Utilization via Dynamic Bi-level Optimization [100.20933466418786]
We propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during training.<n>Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data.<n>We further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
arXiv Detail & Related papers (2025-07-22T02:47:12Z)
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning [22.242445543184264]
We propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop.<n>Experiments show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
arXiv Detail & Related papers (2025-05-12T10:57:51Z)
Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z)
Feasible Learning [78.6167929413604]
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
arXiv Detail & Related papers (2025-01-24T20:39:38Z)
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance. It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis [7.234618871984921]
An emerging area of research aims to learn deep generative models with limited training data. We propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods.
arXiv Detail & Related papers (2024-09-26T00:19:42Z)
Crafting Efficient Fine-Tuning Strategies for Large Language Models [2.633490094119608]
Fine-tuning large language models (LLMs) with as few as 200 samples can improve model accuracy from 70% to 88% in a product attribute extraction task. A bayesian hyperparameter optimization method, which evaluates models at 20% of total training time, correlates strongly with final model performance. This approach led to a 2% improvement in accuracy over baseline models when evaluated on an independent test set.
arXiv Detail & Related papers (2024-07-18T21:36:00Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
Rethinking Overlooked Aspects in Vision-Language Models [32.525916879333145]
Recent advancements in vision-language models (LVLMs) have been substantial. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets.
arXiv Detail & Related papers (2024-05-20T07:53:41Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z)
An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration [11.102950630209879]
In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. We examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration.
arXiv Detail & Related papers (2023-07-17T01:27:10Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation. We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.