Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models
- URL: http://arxiv.org/abs/2510.05129v2
- Date: Sat, 11 Oct 2025 14:17:12 GMT
- Title: Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models
- Authors: Qingshu Xu, Hong Jiao, Tianyi Zhou, Ming Li, Nan Zhang, Sydney Peters, Yanbin Fu,
- Abstract summary: This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels.<n>We fine-tuned eight BERT model and its variants for both domain and skill alignment.<n> Ensemble models did not surpass the best-performing language models.
- Score: 17.18401180371013
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments. This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels. First, we extracted embeddings and trained multiple classical supervised machine learning models, and further investigated the impact of dimensionality reduction on model performance. Second, we fine-tuned eight BERT model and its variants for both domain and skill alignment. Third, we explored ensemble learning with majority voting and stacking with multiple meta-models. The DeBERTa-v3-base achieved the highest weighted-average F1 score of 0.950 for domain alignment while the RoBERTa-large yielded the highest F1 score of 0.869 for skill alignment. Ensemble models did not surpass the best-performing language models. Dimension reduction enhanced linear classifiers based on embeddings but did not perform better than language models. This study demonstrated different methods in automated item alignment to content standards.}
Related papers
- Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays [8.899249868081956]
Long context may impose challenges for encoder-only language models in text processing.<n>This study trained several commonly used encoder-based language models for automated scoring of long essays.
arXiv Detail & Related papers (2026-01-06T02:17:45Z) - Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests [16.474453687125948]
This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment.<n>The impact of types and sizes of the input data for training was investigated.<n>The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models.
arXiv Detail & Related papers (2025-09-30T15:53:22Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model [27.25688303240741]
KaLM-Embedding is a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data.<n>Our model has been trained with key techniques proven to enhance performance.
arXiv Detail & Related papers (2025-01-02T03:17:51Z) - Enabling Small Models for Zero-Shot Selection and Reuse through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.<n>Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Gestalt: a Stacking Ensemble for SQuAD2.0 [0.0]
We propose a deep-learning system that finds, or indicates the lack of, a correct answer to a question in a context paragraph.
Our goal is to learn an ensemble of heterogeneous SQuAD2.0 models that outperforms the best model in the ensemble per se.
arXiv Detail & Related papers (2020-04-02T08:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.