Related papers: MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

URL: http://arxiv.org/abs/2505.18549v1
Date: Sat, 24 May 2025 06:32:02 GMT
Title: MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors
Authors: Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi,
Abstract summary: We present our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions.<n>Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks.<n>Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Related papers

NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors [0.12499537119440242]
This paper presents our system for Track 1, Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's reasoning.<n>Our system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided parsing produceable predictions.
arXiv Detail & Related papers (2025-06-12T12:11:56Z)
BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses [0.7475784495279183]
We present our submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages.<n>Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set.
arXiv Detail & Related papers (2025-06-02T15:57:49Z)
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches [46.0474342507327]
We introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique.<n>Our method evaluates a model's multiple abilities to teach weaker student models to perform tasks effectively.
arXiv Detail & Related papers (2025-05-18T06:51:10Z)
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.<n>Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z)
ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging [43.45477240307602]
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models.<n>This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues.<n>We propose an unlearning system that leverages Model Merging, combining two specialized models into a more balanced unlearned model.
arXiv Detail & Related papers (2025-03-27T02:03:25Z)
IHEval: Evaluating Language Models on Following the Instruction Hierarchy [67.33509094445104]
The instruction hierarchy establishes a priority order from system messages to user messages, conversation history, and tool outputs.<n>Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy.<n>We bridge this gap by introducing IHEval, a novel benchmark covering cases where instructions in different priorities either align or conflict.
arXiv Detail & Related papers (2025-02-12T19:35:28Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model [121.23360004498893]
We present a benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting. We introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment.
arXiv Detail & Related papers (2024-03-13T08:54:31Z)
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z)
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization [101.37439352091612]
We describe the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. We present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT.
arXiv Detail & Related papers (2022-12-22T19:56:09Z)
CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD. Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z)
Revisiting Unsupervised Meta-Learning: Amplifying or Compensating for the Characteristics of Few-Shot Tasks [30.893785366366078]
We develop a practical approach towards few-shot image classification, where a visual recognition system is constructed with limited data. We find that the base class set labels are not necessary, and discriminative embeddings could be meta-learned in an unsupervised manner. Experiments on few-shot learning benchmarks verify our approaches outperform previous methods by a 4-10% performance gap.
arXiv Detail & Related papers (2020-11-30T10:08:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.