BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses
- URL: http://arxiv.org/abs/2506.01817v1
- Date: Mon, 02 Jun 2025 15:57:49 GMT
- Title: BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses
- Authors: Shadman Rohan, Ishita Sur Apan, Muhtasim Ibteda Shochcho, Md Fahim, Mohammad Ashfaq Ur Rahman, AKM Mahbubur Rahman, Amin Ahsan Ali,
- Abstract summary: We present our submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages.<n>Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set.
- Score: 0.7475784495279183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Team BD's submission to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors, under Track 1 (Mistake Identification) and Track 2 (Mistake Location). Both tracks involve three-class classification of tutor responses in educational dialogues - determining if a tutor correctly recognizes a student's mistake (Track 1) and whether the tutor pinpoints the mistake's location (Track 2). Our system is built on MPNet, a Transformer-based language model that combines BERT and XLNet's pre-training advantages. We fine-tuned MPNet on the task data using a class-weighted cross-entropy loss to handle class imbalance, and leveraged grouped cross-validation (10 folds) to maximize the use of limited data while avoiding dialogue overlap between training and validation. We then performed a hard-voting ensemble of the best models from each fold, which improves robustness and generalization by combining multiple classifiers. Our approach achieved strong results on both tracks, with exact-match macro-F1 scores of approximately 0.7110 for Mistake Identification and 0.5543 for Mistake Location on the official test set. We include comprehensive analysis of our system's performance, including confusion matrices and t-SNE visualizations to interpret classifier behavior, as well as a taxonomy of common errors with examples. We hope our ensemble-based approach and findings provide useful insights for designing reliable tutor response evaluation systems in educational dialogue settings.
Related papers
- NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors [0.12499537119440242]
This paper presents our system for Track 1, Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors.<n>The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's reasoning.<n>Our system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided parsing produceable predictions.
arXiv Detail & Related papers (2025-06-12T12:11:56Z) - MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors [0.0]
We present our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions.<n>Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks.<n>Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location.
arXiv Detail & Related papers (2025-05-24T06:32:02Z) - ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging [43.45477240307602]
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models.<n>This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues.<n>We propose an unlearning system that leverages Model Merging, combining two specialized models into a more balanced unlearned model.
arXiv Detail & Related papers (2025-03-27T02:03:25Z) - From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education [24.970741456147447]
Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K.<n>However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation.<n>We introduce textbfMathCCS, a benchmark designed for systematic error analysis and tailored feedback.<n>Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision.<n>Third, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-
arXiv Detail & Related papers (2025-02-19T14:57:51Z) - A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z) - PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination [14.50214193838818]
A zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption.
Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions.
In this study, we employ class-aware prototype alignment weighted by mean class probabilities obtained for a test sample and filtered augmented views.
arXiv Detail & Related papers (2024-04-11T07:26:00Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - One-bit Supervision for Image Classification: Problem, Solution, and
Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification.
We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm.
In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z) - Cross-head mutual Mean-Teaching for semi-supervised medical image
segmentation [6.738522094694818]
Semi-supervised medical image segmentation (SSMIS) has witnessed substantial advancements by leveraging limited labeled data and abundant unlabeled data.
Existing state-of-the-art (SOTA) methods encounter challenges in accurately predicting labels for the unlabeled data.
We propose a novel Cross-head mutual mean-teaching Network (CMMT-Net) incorporated strong-weak data augmentation.
arXiv Detail & Related papers (2023-10-08T09:13:04Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification.
A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors.
Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.