Related papers: SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling

URL: http://arxiv.org/abs/2510.26322v1
Date: Thu, 30 Oct 2025 10:17:05 GMT
Title: SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Authors: Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, Tanja Käser,
Abstract summary: SCRIBE is a framework for multi-hop, tool-augmented reasoning to generate valid responses to student questions about feedback reports.<n> Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models.
Score: 9.113268651219187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.

Related papers

Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models [0.0]
This study introduces a novel approach that implicitly models Item Response Theory (IRT) psychometric properties.<n>We train models to generate responses to multiple choice questions conditioned on discrete ability descriptors.<n>We reconstruct the probability of a correct response as a function of student ability, effectively generating synthetic Item Characteristic Curves (ICCs) to estimate IRT parameters.
arXiv Detail & Related papers (2026-01-05T22:11:41Z)
One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z)
Teaching Language Models to Reason with Tools [73.21700643314917]
We present emphHint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths.<n>CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model.
arXiv Detail & Related papers (2025-10-23T08:41:44Z)
Towards an Efficient, Customizable, and Accessible AI Tutor [5.225254533678075]
We propose an offline Retrieval-Augmented Generation (RAG) pipeline that pairs a small language model (SLM) with a robust retrieval mechanism.<n>We evaluate the efficacy of this pipeline using domain-specific educational content, focusing on biology coursework.
arXiv Detail & Related papers (2025-10-04T13:33:40Z)
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models [10.963195858672627]
TutorBench is a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of large language models (LLMs)<n>Samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student's confusion, (ii) providing actionable feedback on a student's work, and (iii) promoting active learning through effective hint generation.<n>We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior.
arXiv Detail & Related papers (2025-10-03T01:41:09Z)
SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z)
Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools [42.84219003918423]
This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools.<n>We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors.<n>Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models.
arXiv Detail & Related papers (2025-07-07T08:03:49Z)
Pushing the boundary on Natural Language Inference [49.15148871877941]
Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering and information retrieval.<n>Despite its importance, current NLI systems heavily rely on learning with limiting artifacts and biases, inference and real-world applicability.<n>This work provides a framework for building robust NLI systems without sacrificing quality or real-world applicability.
arXiv Detail & Related papers (2025-04-25T14:20:57Z)
Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant [0.0]
This article focuses on studying three aspects related to such an application.<n>The performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated.
arXiv Detail & Related papers (2025-01-24T08:15:05Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.