Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining
- URL: http://arxiv.org/abs/2509.24356v1
- Date: Mon, 29 Sep 2025 06:54:59 GMT
- Title: Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining
- Authors: Matthew Theodore Roque, Dan John Velasco,
- Abstract summary: We study curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification.<n>We test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved.<n>Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline.
- Score: 0.19258299315493077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.
Related papers
- Text Simplification with Sentence Embeddings [4.484170173286332]
We learn to learn a transformation between sentence embeddings representing high-complexity and low-complexity texts.<n>We conclude that learning transformations in sentence embedding space is a promising direction for future research.
arXiv Detail & Related papers (2025-10-28T12:41:10Z) - Rethinking the Role of Text Complexity in Language Model Pretraining [0.19258299315493077]
Text complexity refers to how hard a text is to read.<n>We simplify human-written texts using a large language model, then pretrain causal models from scratch on both original and simplified data.<n>We find that perplexity is sensitive to the interaction between model capacity and text complexity.
arXiv Detail & Related papers (2025-09-20T06:33:01Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [92.85086256871027]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded.
arXiv Detail & Related papers (2025-06-05T07:12:12Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models [56.93151679231602]
This research identifies two key stylistic elements in responses: linguistic form and instructional surprisal.<n>Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR)<n>SCAR prioritizes instruction-response pairs in the training set based on their response stylistic consistency.
arXiv Detail & Related papers (2024-06-16T10:10:37Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Faithful Low-Resource Data-to-Text Generation through Cycle Training [14.375070014155817]
Methods to generate text from structured data have advanced significantly in recent years.
Cycle training uses two models which are inverses of each other.
We show that cycle training achieves nearly the same performance as fully supervised approaches.
arXiv Detail & Related papers (2023-05-24T06:44:42Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.