TrueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments
- URL: http://arxiv.org/abs/2509.22516v1
- Date: Fri, 26 Sep 2025 16:00:36 GMT
- Title: TrueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments
- Authors: Rakesh Thakur, Shivaansh Kaushik, Gauri Chopra, Harsh Rohilla,
- Abstract summary: TrueGradeAI is an AI-driven digital examination framework designed to overcome the shortcomings of traditional paper-based assessments.<n>The system preserves natural handwriting capturing by stylus input on secure tablets and applying transformer-based optical character recognition for transcription.<n> evaluation is conducted through a retrieval-augmented pipeline that integrates faculty solutions, cache layers, and external references.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces TrueGradeAI, an AI-driven digital examination framework designed to overcome the shortcomings of traditional paper-based assessments, including excessive paper usage, logistical complexity, grading delays, and evaluator bias. The system preserves natural handwriting by capturing stylus input on secure tablets and applying transformer-based optical character recognition for transcription. Evaluation is conducted through a retrieval-augmented pipeline that integrates faculty solutions, cache layers, and external references, enabling a large language model to assign scores with explicit, evidence-linked reasoning. Unlike prior tablet-based exam systems that primarily digitize responses, TrueGradeAI advances the field by incorporating explainable automation, bias mitigation, and auditable grading trails. By uniting handwriting preservation with scalable and transparent evaluation, the framework reduces environmental costs, accelerates feedback cycles, and progressively builds a reusable knowledge base, while actively working to mitigate grading bias and ensure fairness in assessment.
Related papers
- Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark [9.922581736690159]
We present a large-scale empirical study of AI grading on real, handwritten calculus work from UC Irvine.<n>Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions.<n>In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review.
arXiv Detail & Related papers (2026-03-01T03:32:51Z) - ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review [48.60540055009675]
ScholarPeer is a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher.<n>We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations.
arXiv Detail & Related papers (2026-01-30T06:54:55Z) - From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants [0.0]
Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off.<n>This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts.
arXiv Detail & Related papers (2026-01-26T07:44:47Z) - PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review [54.141490756509306]
We introduce PaperAudit-Bench, which consists of two components: PaperAudit-Dataset, an error dataset, and PaperAudit-Review, an automated review framework.<n>Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths.<n>We show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
arXiv Detail & Related papers (2026-01-07T04:26:12Z) - Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification [0.4260312058817663]
Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship.<n>This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions.
arXiv Detail & Related papers (2025-12-14T08:13:53Z) - Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z) - Reflective Paper-to-Code Reproduction Enabled by Fine-Grained Verification [46.845133190560375]
Motivated by how humans use systematic checklists to efficiently debug complex code, we propose textbfRePro, a textbfReflective Paper-to-Code textbfReproduction framework.<n>It automatically extracts a paper's fingerprint, referring to a comprehensive set of accurate and atomic criteria serving as high-quality supervisory signals.<n>It achieves 13.0% performance gap over baselines, and it correctly revises complex logical and mathematical criteria in reflecting.
arXiv Detail & Related papers (2025-08-21T06:57:44Z) - Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education [3.557803321422781]
This article presents a scalable, AI-supported framework for qualitative student feedback using large language models.<n>The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments.<n>We report on its successful deployment across a large college of engineering.
arXiv Detail & Related papers (2025-08-01T20:27:40Z) - Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams [3.4132239125074206]
RATAS (Rubric Automated Tree-based Answer Scoring) is a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses.<n> RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores.
arXiv Detail & Related papers (2025-05-27T22:17:27Z) - Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation [52.51005875755718]
We focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse.<n>Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings.<n>Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences.
arXiv Detail & Related papers (2025-05-21T05:29:55Z) - AutoRev: Automatic Peer Review System for Academic Research Papers [9.269282930029856]
AutoRev is an Automatic Peer Review System for Academic Research Papers.<n>Our framework represents an academic document as a graph, enabling the extraction of the most critical passages.<n>When applied to review generation, our method outperforms SOTA baselines by an average of 58.72%.
arXiv Detail & Related papers (2025-05-20T13:59:58Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Auto-Prompt Generation is Not Robust: Prompt Optimization Driven by Pseudo Gradient [50.15090865963094]
We introduce PertBench, a comprehensive benchmark dataset that includes a wide range of input perturbations.<n>Our analysis reveals substantial vulnerabilities in existing prompt generation strategies.<n>We propose PGO, a gradient-free prompt generation framework that leverages perturbation types as pseudo-gradient signals.
arXiv Detail & Related papers (2024-12-24T06:05:08Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - Toward Educator-focused Automated Scoring Systems for Reading and
Writing [0.0]
This paper addresses the challenges of data and label availability, authentic and extended writing, domain scoring, prompt and source variety, and transfer learning.
It employs techniques that preserve essay length as an important feature without increasing model training costs.
arXiv Detail & Related papers (2021-12-22T15:44:30Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.