Related papers: Analyzing Examinee Comments using DistilBERT and Machine Learning to Ensure Quality Control in Exam Content

Related papers

PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review [54.141490756509306]
We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
arXiv Detail & Related papers (2026-01-07T04:26:12Z)
Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z)
The Role of Review Process Failures in Affective State Estimation: An Empirical Investigation of DEAP Dataset [0.45080838507508303]
We reviewed 101 studies, focusing on the widely used DEAP dataset for emotion recognition.<n>We found that nearly 87% of the reviewed papers contained one or more of these errors.<n>These findings reveal fundamental gaps in standardized evaluation practices and highlight critical deficiencies in the peer review process for machine learning applications in neuroscience.
arXiv Detail & Related papers (2025-08-04T13:40:25Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Evaluating LLM-Generated Q&A Test: a Student-Centered Study [0.06749750044497731]
We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts.<n>A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality.
arXiv Detail & Related papers (2025-05-10T10:47:23Z)
Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation [5.4260854377259315]
We introduce a quality control assessment designed to test the viability of LLM-simulated opinions on Likert-scale tasks.<n>This assessment comprises two key tests: emphlogical consistency and emphalignment with stakeholder expectations.<n>None of the models or methods pass the full assessment, revealing several failure modes.
arXiv Detail & Related papers (2025-04-11T20:16:02Z)
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation [13.202947148434333]
We introduce test item analysis, a method frequently used to assess test question quality, into QG evaluation.<n>We construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency.<n>We propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation.
arXiv Detail & Related papers (2025-03-07T19:21:59Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models. We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity. A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Zero-shot Generative Large Language Models for Systematic Review Screening Automation [55.403958106416574]
This study investigates the effectiveness of using zero-shot large language models for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold.
arXiv Detail & Related papers (2024-01-12T01:54:08Z)
Measuring Software Testability via Automatically Generated Test Cases [8.17364116624769]
We propose a new approach to pursuing testability measurements based on software metrics. Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases.
arXiv Detail & Related papers (2023-07-30T09:48:51Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.