VerAs: Verify then Assess STEM Lab Reports
- URL: http://arxiv.org/abs/2402.05224v2
- Date: Thu, 25 Apr 2024 16:16:36 GMT
- Title: VerAs: Verify then Assess STEM Lab Reports
- Authors: Berk Atil, Mahsa Sheikhi Karizaki, Rebecca J. Passonneau,
- Abstract summary: A dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics.
Each analytic dimension is assessed on a 6-point scale, to provide detailed feedback to students that can help them improve their science writing skills.
We present an end-to-end neural architecture that has separate verifier and assessment modules, inspired by approaches to Open Domain Question Answering (OpenQA)
- Score: 2.4169078025984825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With an increasing focus in STEM education on critical thinking skills, science writing plays an ever more important role in curricula that stress inquiry skills. A recently published dataset of two sets of college level lab reports from an inquiry-based physics curriculum relies on analytic assessment rubrics that utilize multiple dimensions, specifying subject matter knowledge and general components of good explanations. Each analytic dimension is assessed on a 6-point scale, to provide detailed feedback to students that can help them improve their science writing skills. Manual assessment can be slow, and difficult to calibrate for consistency across all students in large classes. While much work exists on automated assessment of open-ended questions in STEM subjects, there has been far less work on long-form writing such as lab reports. We present an end-to-end neural architecture that has separate verifier and assessment modules, inspired by approaches to Open Domain Question Answering (OpenQA). VerAs first verifies whether a report contains any content relevant to a given rubric dimension, and if so, assesses the relevant sentences. On the lab reports, VerAs outperforms multiple baselines based on OpenQA systems or Automated Essay Scoring (AES). VerAs also performs well on an analytic rubric for middle school physics essays.
Related papers
- Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles.
We employ automatic and manual curation to create the dataset.
SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z) - SyllabusQA: A Course Logistics Question Answering Dataset [45.90423821963144]
We introduce SyllabusQA, an open-source dataset with 63 real course syllabi covering 36 majors, containing 5,078 open-ended course logistics-related question-answer pairs.
We benchmark several strong baselines on this task, from large language model prompting to retrieval-augmented generation.
We find that despite performing close to humans on traditional metrics of textual similarity, there remains a significant gap between automated approaches and humans in terms of fact precision.
arXiv Detail & Related papers (2024-03-03T03:01:14Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - A Systematic Literature Review of Empiricism and Norms of Reporting in
Computing Education Research Literature [4.339510167603376]
The goal of this study is to characterize the reporting of empiricism in Computing Education Research (CER) literature.
We conducted an SLR of 427 papers published during 2014 and 2015 in five CER venues.
Over 80% of papers had some form of empirical evaluation.
arXiv Detail & Related papers (2021-07-02T16:37:29Z) - YAPS -- Your Open Examination System for Activating and emPowering
Students [0.0]
We discuss design decisions and present the resulting architecture of YAPS - Your open Assessment system for emPowering Students.
YAPS has been used for very diverse lectures in logistics, computer engineering, and algorithms for exams, but also for empowering students by fast feedback during the learning period.
arXiv Detail & Related papers (2021-04-27T09:52:43Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.