Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
- URL: http://arxiv.org/abs/2408.09327v2
- Date: Wed, 26 Feb 2025 05:16:14 GMT
- Title: Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
- Authors: Jiancheng Dong, Lei Jiang, Wei Jin, Lu Cheng,
- Abstract summary: We introduce Threshold Packing, a method that selects samples with related context while maintaining sufficient diversity within the same pack.<n>Our experiments show that Threshold Packing significantly enhances SFT performance, with observed improvements of up to 7% on GSM8K, 4% on HumanEval datasets.<n>Results from bias benchmark sequences highlight Supervised Fine-Tuning's promising performance in improving fairness while also boosting prediction accuracy by 15%.
- Score: 10.85875548925946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Packing for Supervised Fine-Tuning (SFT) in autoregressive models involves concatenating data points of varying lengths until reaching the designed maximum length to facilitate GPU processing. However, randomly concatenating data points can lead to cross-contamination of sequences due to the significant difference in their subject matter. The mainstream approaches in SFT ensure that each token in the attention calculation phase only focuses on tokens within its own short sequence, without providing additional learning signals for the preceding context. To address these challenges, we introduce Threshold Filtering Packing (TFP), a method that selects samples with related context while maintaining sufficient diversity within the same pack. Our experiments show that TFP offers a simple-to-implement and scalable approach that significantly enhances SFT performance, with observed improvements of up to 7\% on GSM8K, 4\% on HumanEval. Furthermore, results from bias benchmark datasets highlight TFP's promising performance in improving fairness while also boosting prediction accuracy by 15\%.
Related papers
- Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process [26.196705232699884]
We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process.
IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods.
An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
arXiv Detail & Related papers (2024-05-20T08:23:28Z) - Multiple Instance Learning with random sampling for Whole Slide Image
Classification [0.0]
Random sampling of patches during training is computationally efficient and serves as a regularization strategy.
We find optimal performance enhancement of 1.7% using thirty percent of patches on the CAMELYON16 dataset, and 3.7% with only eight samples on the TUPAC16 dataset.
We also find interpretability effects are strongly dataset-dependent, with interpretability impacted on CAMELYON16, while remaining unaffected on TUPAC16.
arXiv Detail & Related papers (2024-03-08T14:31:40Z) - ALF: Adaptive Label Finetuning for Scene Graph Generation [116.59868289196157]
Scene Graph Generation endeavors to predict the relationships between subjects and objects in a given image.
Long-tail distribution of relations often leads to biased prediction on coarse labels, presenting a substantial hurdle in SGG.
We introduce one-stage data transfer pipeline in SGG, termed Adaptive Label Finetuning (ALF), which eliminates the need for extra retraining sessions.
ALF achieves a 16% improvement in mR@100 compared to the typical SGG method Motif, with only a 6% increase in calculation costs compared to the state-of-the-art method IETrans.
arXiv Detail & Related papers (2023-12-29T01:37:27Z) - Enhancing Trade-offs in Privacy, Utility, and Computational Efficiency through MUltistage Sampling Technique (MUST) [3.0939420223851446]
We propose a class of subsampling methods for privacy amplification (PA)
We conduct comprehensive analyses of the PA effects and utility for several 2-stage MUST procedures.
We provide the privacy loss composition analysis over repeated applications of MUST.
arXiv Detail & Related papers (2023-12-20T19:38:29Z) - Embarassingly Simple Dataset Distillation [0.0]
We tackle dataset distillation at its core by treating it directly as a bilevel optimization problem.
A deeper dive into the nature of distilled data unveils pronounced intercorrelation.
We devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
arXiv Detail & Related papers (2023-11-13T02:14:54Z) - Diverse Data Augmentation with Diffusions for Effective Test-time Prompt
Tuning [73.75282761503581]
We propose DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data.
Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13%.
arXiv Detail & Related papers (2023-08-11T09:36:31Z) - SLPT: Selective Labeling Meets Prompt Tuning on Label-Limited Lesion
Segmentation [57.37875162629063]
We propose a framework that combines selective labeling with prompt tuning to boost performance in limited labels.
We evaluate our method on liver tumor segmentation and achieve state-of-the-art performance, outperforming traditional fine-tuning with only 6% of tunable parameters.
arXiv Detail & Related papers (2023-08-09T12:22:49Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Hint-Aug: Drawing Hints from Foundation Vision Transformers Towards
Boosted Few-Shot Parameter-Efficient Tuning [22.0296008705388]
We propose a framework called Hint-based Data Augmentation (Hint-Aug)
It aims to boost foundation vision transformers (FViTs) in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs.
Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness.
arXiv Detail & Related papers (2023-04-25T02:22:01Z) - GFlowNet-EM for learning compositional latent variable models [115.96660869630227]
A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization.
We propose the use of GFlowNets, algorithms for sampling from an unnormalized density.
By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational algorithms.
arXiv Detail & Related papers (2023-02-13T18:24:21Z) - Dense FixMatch: a simple semi-supervised learning method for pixel-wise
prediction tasks [68.36996813591425]
We propose Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks.
We enable the application of FixMatch in semi-supervised learning problems beyond image classification by adding a matching operation on the pseudo-labels.
Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.
arXiv Detail & Related papers (2022-10-18T15:02:51Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.