SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
- URL: http://arxiv.org/abs/2510.26322v1
- Date: Thu, 30 Oct 2025 10:17:05 GMT
- Title: SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
- Authors: Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, Tanja Käser,
- Abstract summary: SCRIBE is a framework for multi-hop, tool-augmented reasoning to generate valid responses to student questions about feedback reports.<n> Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models.
- Score: 9.113268651219187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
Related papers
- Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models [0.0]
This study introduces a novel approach that implicitly models Item Response Theory (IRT) psychometric properties.<n>We train models to generate responses to multiple choice questions conditioned on discrete ability descriptors.<n>We reconstruct the probability of a correct response as a function of student ability, effectively generating synthetic Item Characteristic Curves (ICCs) to estimate IRT parameters.
arXiv Detail & Related papers (2026-01-05T22:11:41Z) - One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z) - Teaching Language Models to Reason with Tools [73.21700643314917]
We present emphHint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths.<n>CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model.
arXiv Detail & Related papers (2025-10-23T08:41:44Z) - Towards an Efficient, Customizable, and Accessible AI Tutor [5.225254533678075]
We propose an offline Retrieval-Augmented Generation (RAG) pipeline that pairs a small language model (SLM) with a robust retrieval mechanism.<n>We evaluate the efficacy of this pipeline using domain-specific educational content, focusing on biology coursework.
arXiv Detail & Related papers (2025-10-04T13:33:40Z) - TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models [10.963195858672627]
TutorBench is a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of large language models (LLMs)<n>Samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student's confusion, (ii) providing actionable feedback on a student's work, and (iii) promoting active learning through effective hint generation.<n>We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior.
arXiv Detail & Related papers (2025-10-03T01:41:09Z) - SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z) - Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools [42.84219003918423]
This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools.<n>We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors.<n>Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models.
arXiv Detail & Related papers (2025-07-07T08:03:49Z) - Pushing the boundary on Natural Language Inference [49.15148871877941]
Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering and information retrieval.<n>Despite its importance, current NLI systems heavily rely on learning with limiting artifacts and biases, inference and real-world applicability.<n>This work provides a framework for building robust NLI systems without sacrificing quality or real-world applicability.
arXiv Detail & Related papers (2025-04-25T14:20:57Z) - Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant [0.0]
This article focuses on studying three aspects related to such an application.<n>The performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated.
arXiv Detail & Related papers (2025-01-24T08:15:05Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.