Related papers: Rethinking the Role of Text Complexity in Language Model Pretraining

Rethinking the Role of Text Complexity in Language Model Pretraining

URL: http://arxiv.org/abs/2509.16551v1
Date: Sat, 20 Sep 2025 06:33:01 GMT
Title: Rethinking the Role of Text Complexity in Language Model Pretraining
Authors: Dan John Velasco, Matthew Theodore Roque,
Abstract summary: Text complexity refers to how hard a text is to read.<n>We simplify human-written texts using a large language model, then pretrain causal models from scratch on both original and simplified data.<n>We find that perplexity is sensitive to the interaction between model capacity and text complexity.
Score: 0.19258299315493077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity is less explored. Text complexity refers to how hard a text is to read, and is typically estimated from surface cues such as sentence length, word choice, and sentence structure. We reduce surface-level complexity--shorter sentences, simpler words, simpler structure--while keeping core text content close to constant, and ask: (1) How does complexity affect language modeling across model sizes? (2) Can useful representations be learned from simpler text alone? (3) How does pretraining text complexity influence downstream language understanding? To answer these questions, we simplify human-written texts using a large language model, then pretrain causal models (28M-500M) from scratch on both original and simplified data, and evaluate them in finetuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on finetuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking.

Related papers

Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining [0.19258299315493077]
We study curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification.<n>We test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved.<n>Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline.
arXiv Detail & Related papers (2025-09-29T06:54:59Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z)
LC-Score: Reference-less estimation of Text Comprehension Difficulty [0.0]
We present textscLC-Score, a simple approach for training text comprehension metric for any French text without reference. Our objective is to quantitatively capture the extend to which a text suits to the textitLangage Clair (LC, textitClear Language) guidelines. We explore two approaches: (i) using linguistically motivated indicators used to train statistical models, and (ii) neural learning directly from text leveraging pre-trained language models.
arXiv Detail & Related papers (2023-10-04T11:49:37Z)
Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts [0.0]
We employ Abstract Representation (AMR) graphs as a means to model low-resource Health NLP tasks. AMRs are well suited to model online health texts as they represent multi-sentence inputs, abstract away from complex terminology, and model long-distance relationships. Our experiments show that we can improve performance on 6 low-resource health NLP tasks by augmenting text embeddings with semantic graph embeddings.
arXiv Detail & Related papers (2023-09-18T15:37:30Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Lexical Complexity Controlled Sentence Generation [6.298911438929862]
We introduce a novel task of lexical complexity controlled sentence generation. It has enormous potential in domains such as grade reading, language teaching and acquisition. We propose a simple but effective approach for this task based on complexity embedding.
arXiv Detail & Related papers (2022-11-26T11:03:56Z)
One Size Does Not Fit All: The Case for Personalised Word Complexity Models [4.035753155957698]
Complex Word Identification (CWI) aims to detect words within a text that a reader may find difficult to understand. In this paper, we show that personal models are best when predicting word complexity for individual readers.
arXiv Detail & Related papers (2022-05-05T10:53:31Z)
Uniform Complexity for Text Generation [4.867923281108005]
We introduce Uniform Complexity for Text Generation (UCTG), a new benchmark test which raises the challenge of making generative models observe uniform linguistic properties with respect to prompts. We find that models such as GPT-2 struggle to preserve the complexity of input prompts used in its generations, even if finetuned with professionally written texts.
arXiv Detail & Related papers (2022-04-11T15:19:47Z)
CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning [65.57338873921168]
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text.
arXiv Detail & Related papers (2021-12-14T16:22:25Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Toward the Understanding of Deep Text Matching Models for Information Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval. Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint. Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z)
Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa. Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages. We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z)
Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification [13.447565774887215]
Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process.
arXiv Detail & Related papers (2020-07-31T03:33:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.