Related papers: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

URL: http://arxiv.org/abs/2506.10627v1
Date: Thu, 12 Jun 2025 12:11:56 GMT
Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Authors: Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal,
Abstract summary: This paper presents our system for Track 1, Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's reasoning.<n>Our system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided parsing produceable predictions.
Score: 0.12499537119440242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.

Related papers

BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses [0.7475784495279183]
We present our submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages.<n>Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set.
arXiv Detail & Related papers (2025-06-02T15:57:49Z)
MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors [0.0]
We present our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions.<n>Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks.<n>Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location.
arXiv Detail & Related papers (2025-05-24T06:32:02Z)
Demo-Craft: Using In-Context Learning to Improve Code Generation in Large Language Models [0.0]
We propose DemoCraft, which enhances code generation by leveraging in-context learning and demonstration selection.<n>Latent concept learning introduces additional concept tokens, which are trainable embeddings that capture task-specific knowledge.<n>Our experimental results demonstrate that the proposed system achieves an approximate 2x increase in the pass@k metric.<n>Our empirical studies indicate that our system attains nearly a 3x improvement in these metrics as well.
arXiv Detail & Related papers (2024-10-30T19:45:50Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient. We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z)
Limits of Transformer Language Models on Learning to Compose Algorithms [77.2443883991608]
We evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient.
arXiv Detail & Related papers (2024-02-08T16:23:29Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation [0.0]
This paper addresses the SemEval-2022 Task 3 PreTENS: Presupposed Taxonomies evaluating Neural Network Semantics. The goal of the task is to identify if a sentence is deemed acceptable or not, depending on the taxonomic relationship that holds between a noun pair contained in the sentence. We propose an effective way to enhance the robustness and the generalizability of language models for better classification.
arXiv Detail & Related papers (2022-10-07T07:41:28Z)
Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution [0.0]
We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods.
arXiv Detail & Related papers (2022-09-30T23:10:11Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR) In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework. Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z)
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines. In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.