Related papers: Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach

Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach

URL: http://arxiv.org/abs/2601.03469v1
Date: Tue, 06 Jan 2026 23:45:18 GMT
Title: Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach
Authors: Nadav Kunievsky, Pedro Pertusi,
Abstract summary: A central question is how much of the socioeconomic-status gap in scores is driven by differences in what students say versus how they say it.<n>We study this question using a large corpus of persuasive essays written by U.S. middle- and high-school students.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Students from different socioeconomic backgrounds exhibit persistent gaps in test scores, gaps that can translate into unequal educational and labor-market outcomes later in life. In many assessments, performance reflects not only what students know, but also how effectively they can communicate that knowledge. This distinction is especially salient in writing assessments, where scores jointly reward the substance of students' ideas and the way those ideas are expressed. As a result, observed score gaps may conflate differences in underlying content with differences in expressive skill. A central question, therefore, is how much of the socioeconomic-status (SES) gap in scores is driven by differences in what students say versus how they say it. We study this question using a large corpus of persuasive essays written by U.S. middle- and high-school students. We introduce a new measurement strategy that separates content from style by leveraging large language models to generate multiple stylistic variants of each essay. These rewrites preserve the underlying arguments while systematically altering surface expression, creating a "generated panel" that introduces controlled within-essay variation in style. This approach allows us to decompose SES gaps in writing scores into contributions from content and style. We find an SES gap of 0.67 points on a 1-6 scale. Approximately 69% of the gap is attributable to differences in essay content quality, Style differences account for 26% of the gap, and differences in evaluation standards across SES groups account for the remaining 5%. These patterns seems stable across demographic subgroups and writing tasks. More broadly, our approach shows how large language models can be used to generate controlled variation in observational data, enabling researchers to isolate and quantify the contributions of otherwise entangled factors.

Related papers

Large Language Models Often Say One Thing and Do Another [49.22262396351797]
We develop a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT)<n>The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains.<n>The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains.
arXiv Detail & Related papers (2025-03-10T07:34:54Z)
The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models [7.811355338367627]
We show that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity.<n>We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others.
arXiv Detail & Related papers (2025-02-16T20:51:07Z)
Graded Relevance Scoring of Written Essays with Dense Retrieval [4.021352247826289]
We propose a novel approach for graded relevance scoring of written essays that employs dense retrieval encoders. We leverage Contriever, which is pre-trained with contrastive learning and demonstrated comparable performance to supervised dense retrieval models. Our method establishes a new state-of-the-art performance in the task-specific scenario, while its extension for the cross-task scenario exhibited a performance that is on par with the state-of-the-art model for that scenario.
arXiv Detail & Related papers (2024-05-08T16:37:58Z)
Word Importance Explains How Prompts Affect Language Model Outputs [0.7223681457195862]
This study presents a method to improve the explainability of large language models by varying individual words in prompts. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.
arXiv Detail & Related papers (2024-03-05T15:04:18Z)
Comprehending Lexical and Affective Ontologies in the Demographically Diverse Spatial Social Media Discourse [0.0]
This study aims to comprehend linguistic and socio-demographic features, encompassing English language styles, conveyed sentiments, and lexical diversity within social media data. Our analysis entails the extraction and examination of various statistical, grammatical, and sentimental features from two groups. Our investigation unveils substantial disparities in certain linguistic attributes between the two groups, yielding a macro F1 score of approximately 0.85.
arXiv Detail & Related papers (2023-11-12T04:23:33Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Diversify Question Generation with Retrieval-Augmented Style Transfer [68.00794669873196]
We propose RAST, a framework for Retrieval-Augmented Style Transfer. The objective is to utilize the style of diverse templates for question generation. We develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward.
arXiv Detail & Related papers (2023-10-23T02:27:31Z)
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z)
Does Writing with Language Models Reduce Content Diversity? [16.22006159795341]
Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity.
arXiv Detail & Related papers (2023-09-11T02:16:47Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples. We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms. We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.