Improving Continual Pre-training Through Seamless Data Packing
- URL: http://arxiv.org/abs/2505.22018v2
- Date: Thu, 29 May 2025 07:20:02 GMT
- Title: Improving Continual Pre-training Through Seamless Data Packing
- Authors: Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang,
- Abstract summary: We propose a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance.<n>Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences.<n>In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation.
- Score: 34.13195340154738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
Related papers
- SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency [10.555957282859]
This paper introduces a novel dynamic bootstrapping dataset pruning method.
It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates.
We individually pre-train seven CLIP models on two large-scale image-text pair datasets, and two MoCo models on the ImageNet dataset, resulting in a total of 16 pre-trained models.
arXiv Detail & Related papers (2024-11-14T01:53:17Z) - Task-Oriented Pre-Training for Drivable Area Detection [5.57325257338134]
We propose a task-oriented pre-training method that begins with generating redundant segmentation proposals.
We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model.
This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data.
arXiv Detail & Related papers (2024-09-30T10:25:47Z) - Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Beyond Fixed Length: Bucket Pre-training is All You Need [27.273944625005377]
We propose a novel multi-bucket data composition method that transcends the fixed-length paradigm.<n>Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics.
arXiv Detail & Related papers (2024-07-10T09:27:23Z) - Repeated Padding+: Simple yet Effective Data Augmentation Plugin for Sequential Recommendation [9.913317029557588]
We propose a simple yet effective padding method called Repeated Padding+ (RepPad+)<n>Our method contains no trainable parameters or hypersequences and is a plug-and-play data augmentation operation.<n>The average recommendation performance improvement is up to 84.11% on GRU4Rec and 35.34% on SASRec.
arXiv Detail & Related papers (2024-03-11T01:50:41Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - TSO: Curriculum Generation using continuous optimization [0.0]
We present a simple and efficient technique based on continuous optimization.
An encoder network maps/embeds training sequence into continuous space.
A predictor network uses the continuous representation of a strategy as input and predicts the accuracy for fixed network architecture.
arXiv Detail & Related papers (2021-06-16T06:32:21Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.