Developing a Multi-Agent System to Generate Next Generation Science Assessments with Evidence-Centered Design
- URL: http://arxiv.org/abs/2602.18451v1
- Date: Tue, 03 Feb 2026 04:37:24 GMT
- Title: Developing a Multi-Agent System to Generate Next Generation Science Assessments with Evidence-Centered Design
- Authors: Yaxuan Yang, Jongchan Park, Yifan Zhou, Xiaoming Zhai,
- Abstract summary: The Next Generation Science Standards (NGSS) demand assessments to understand students' ability to use science knowledge to solve problems and design solutions.<n>To elicit such higher-order ability, educators need performance-based assessments, which are challenging to develop.<n>One solution that has been broadly adopted is Evidence-Centered Design (ECD), which emphasizes interconnected models of the learner, evidence, and tasks.
- Score: 9.558651359587358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contemporary science education reforms such as the Next Generation Science Standards (NGSS) demand assessments to understand students' ability to use science knowledge to solve problems and design solutions. To elicit such higher-order ability, educators need performance-based assessments, which are challenging to develop. One solution that has been broadly adopted is Evidence-Centered Design (ECD), which emphasizes interconnected models of the learner, evidence, and tasks. Although ECD provides a framework to safeguard assessment validity, its implementation requires diverse expertise (e.g., content and assessment), which is both costly and labor-intensive. To address this challenge, this study proposed integrating the ECD framework into Multi-Agent Systems (MAS) to generate NGSS-aligned assessment items automatically. This integrated MAS system ensembles multiple large language models with varying expertise, enabling the automation of complex, multi-stage item generation workflows traditionally performed by human experts. We examined the quality of AI-generated NGSS-aligned items and compared them with human-developed items across multiple dimensions of assessment design. Results showed that AI-generated items have overall comparable quality to human-developed items in terms of alignment with NGSS three-dimensional standards and cognitive demands. Divergent patterns also emerged: AI-generated items demonstrated a distinct strength in inclusivity, while also exhibiting limitations in clarity, conciseness, and multimodal design. AI- and human-developed items both showed weaknesses in evidence collectability and student interest alignment. These findings suggest that integrating ECD into MAS can support scalable and standards-aligned assessment design, while human expertise remains essential.
Related papers
- Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models [4.155649113742267]
ReQUESTA is a hybrid, multi-agent framework for generating cognitively diverse multiple-choice questions (MCQs)<n>We evaluated the framework in a large-scale reading comprehension study using academic expository texts.<n>Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance.
arXiv Detail & Related papers (2026-02-03T16:26:47Z) - EduAgentQG: A Multi-Agent Workflow Framework for Personalized Question Generation [56.43882334582494]
We propose EduAgentQG, a multi-agent collaborative framework for generating high-quality and diverse personalized questions.<n>The framework consists of five specialized agents and operates through an iterative feedback loop.<n>EduAgentQG outperforms existing single-agent and multi-agent methods in terms of question diversity, goal consistency, and overall quality.
arXiv Detail & Related papers (2025-11-08T12:25:31Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Rethinking Machine Unlearning in Image Generation Models [59.697750585491264]
CatIGMU is a novel hierarchical task categorization framework.<n>EvalIGMU is a comprehensive evaluation framework.<n>We construct DataIGM, a high-quality unlearning dataset.
arXiv Detail & Related papers (2025-06-03T11:25:14Z) - Creativity in LLM-based Multi-Agent Systems: A Survey [56.25583236738877]
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts.<n>This is the first survey dedicated to creativity in MAS.<n>We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.
arXiv Detail & Related papers (2025-05-27T12:36:14Z) - On the Evaluation of Engineering Artificial General Intelligence [5.802869598386355]
We discuss the challenges and propose a framework for evaluating engineering artificial general intelligence (eAGI) agents.<n>We consider eAGI as a specialization of artificial general intelligence (AGI)<n>eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods.
arXiv Detail & Related papers (2025-05-15T18:52:47Z) - Evaluating Machine Expertise: How Graduate Students Develop Frameworks for Assessing GenAI Content [1.967444231154626]
This paper examines how graduate students develop frameworks for evaluating machine-generated expertise in web-based interactions with large language models (LLMs)<n>Our findings reveal that students construct evaluation frameworks shaped by three main factors: professional identity, verification capabilities, and system navigation experience.
arXiv Detail & Related papers (2025-04-24T22:24:14Z) - Advancing Education through Tutoring Systems: A Systematic Literature Review [3.276010440333338]
This study systematically reviews the transformative role of Tutoring Systems, encompassing Intelligent Tutoring Systems (ITS) and Robot Tutoring Systems (RTS)<n>The findings reveal significant advancements in AI techniques that enhance adaptability, engagement, and learning outcomes.<n>The study highlights the complementary strengths of ITS and RTS, proposing integrated hybrid solutions to maximize educational benefits.
arXiv Detail & Related papers (2025-03-12T18:47:07Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - AtomAgents: Alloy design and discovery through physics-aware multi-modal multi-agent artificial intelligence [0.0]
The proposed physics-aware generative AI platform, AtomAgents, synergizes the intelligence of large language models (LLM)
Our results enable accurate prediction of key characteristics across alloys and highlight the crucial role of solid solution alloying to steer the development of advanced metallic alloys.
arXiv Detail & Related papers (2024-07-13T22:46:02Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.<n>These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.<n>Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - Integration of cognitive tasks into artificial general intelligence test
for large models [54.72053150920186]
We advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests.
The cognitive science-inspired AGI tests encompass the full spectrum of intelligence facets, including crystallized intelligence, fluid intelligence, social intelligence, and embodied intelligence.
arXiv Detail & Related papers (2024-02-04T15:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.