Evaluating 21st-Century Competencies in Postsecondary Curricula with Large Language Models: Performance Benchmarking and Reasoning-Based Prompting Strategies
- URL: http://arxiv.org/abs/2601.10983v1
- Date: Fri, 16 Jan 2026 04:07:23 GMT
- Title: Evaluating 21st-Century Competencies in Postsecondary Curricula with Large Language Models: Performance Benchmarking and Reasoning-Based Prompting Strategies
- Authors: Zhen Xu, Xin Guan, Chenxi Shi, Qinhao Chen, Renzhe Yu,
- Abstract summary: We extend prior curricular analytics research by examining a broader range of curriculum documents, competency frameworks, and models.<n>Using 7,600 manually annotated curriculum-competency alignment scores, we assess the informativeness of different curriculum sources.<n>We introduce a reasoning-based prompting strategy, Curricular CoT, to strengthen LLMs' pedagogical reasoning.
- Score: 6.934935343001595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing emphasis on 21st-century competencies in postsecondary education, intensified by the transformative impact of generative AI, underscores the need to evaluate how these competencies are embedded in curricula and how effectively academic programs align with evolving workforce and societal demands. Curricular Analytics, particularly recent generative AI-powered approaches, offer a promising data-driven pathway. However, analyzing 21st-century competencies requires pedagogical reasoning beyond surface-level information retrieval, and the capabilities of large language models in this context remain underexplored. In this study, we extend prior curricular analytics research by examining a broader range of curriculum documents, competency frameworks, and models. Using 7,600 manually annotated curriculum-competency alignment scores, we assess the informativeness of different curriculum sources, benchmark general-purpose LLMs for curriculum-to-competency mapping, and analyze error patterns. We further introduce a reasoning-based prompting strategy, Curricular CoT, to strengthen LLMs' pedagogical reasoning. Our results show that detailed instructional activity descriptions are the most informative type of curriculum document for competency analytics. Open-weight LLMs achieve accuracy comparable to proprietary models on coarse-grained tasks, demonstrating their scalability and cost-effectiveness for institutional use. However, no model reaches human-level precision in fine-grained pedagogical reasoning. Our proposed Curricular CoT yields modest improvements by reducing bias in instructional keyword inference and improving the detection of nuanced pedagogical evidence in long text. Together, these findings highlight the untapped potential of institutional curriculum documents and provide an empirical foundation for advancing AI-driven curricular analytics.
Related papers
- Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays [15.895792302323883]
In educational contexts, teachers and learners require interpretable, trait-level feedback.<n>We study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms.<n>We show that explicitly modeling score ordinality substantially improves agreement with human raters.
arXiv Detail & Related papers (2026-02-04T14:30:52Z) - Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation [192.53529928861818]
Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI)<n>However, the costs associated with data annotation and model training remain significant.<n>This survey employs active sampling theory to analyze the generalization error and label complexity associated with learning from low-resource data.
arXiv Detail & Related papers (2025-10-10T03:15:42Z) - From Course to Skill: Evaluating LLM Performance in Curricular Analytics [2.5104969073405976]
Large language models (LLMs) are promising for handling large-scale, unstructured curriculum data.<n>We systematically evaluate four text alignment strategies based on LLMs or traditional NLP methods for skill extraction.<n>Our findings highlight the promise of LLMs in analyzing brief and abstract curriculum documents, but also reveal that their performance can vary significantly depending on model selection and prompting strategies.
arXiv Detail & Related papers (2025-05-05T02:46:23Z) - EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework [9.76455227840645]
Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging.<n>We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios.
arXiv Detail & Related papers (2025-04-21T07:48:20Z) - An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z) - SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [118.8024915014751]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.<n>However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.<n>We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z) - A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models [0.0]
We propose a comprehensive approach to benchmark development based on rigorous psychometric principles.
We make the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education.
We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development.
arXiv Detail & Related papers (2024-10-29T19:32:43Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA)
First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios.
Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository.
Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z) - Post Hoc Explanations of Language Models Can Improve Language Models [43.2109029463221]
We present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY)
We leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions.
Our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks.
arXiv Detail & Related papers (2023-05-19T04:46:04Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.