Related papers: Assessing the efficacy of large language models in generating accurate teacher responses

Assessing the efficacy of large language models in generating accurate teacher responses

URL: http://arxiv.org/abs/2307.04274v1
Date: Sun, 9 Jul 2023 22:32:46 GMT
Title: Assessing the efficacy of large language models in generating accurate teacher responses
Authors: Yann Hicke, Abhishek Masand, Wentao Guo, Tushaar Gangavarapu
Abstract summary: This study attempts to assess the generative abilities of large language models in providing informative and helpful insights to students. We present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Our experimental findings on the Teacher-Student Chatroom subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
Score: 0.5774786149181391
License: http://creativecommons.org/licenses/by/4.0/
Abstract: (Tack et al., 2023) organized the shared task hosted by the 18th Workshop on Innovative Use of NLP for Building Educational Applications on generation of teacher language in educational dialogues. Following the structure of the shared task, in this study, we attempt to assess the generative abilities of large language models in providing informative and helpful insights to students, thereby simulating the role of a knowledgeable teacher. To this end, we present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we fine-tuned the Flan-T5 model using reinforcement learning. Our experimental findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT. We hypothesize that several dataset characteristics, including sampling, representativeness, and dialog completeness, pose significant challenges to fine-tuning, thus contributing to the poor generalizability of the fine-tuned models. Finally, we note the need for these generative models to be evaluated with a metric that relies not only on dialog coherence and matched language modeling distribution but also on the model's ability to showcase pedagogical skills.

Related papers

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios [41.370448581863194]
We introduce the first diverse benchmark tailored for educational scenarios.<n>We propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students.<n>We train a relatively small-scale model on our constructed dataset and demonstrate it can achieve performance comparable to state-of-the-art large models.
arXiv Detail & Related papers (2025-05-22T03:01:28Z)
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors [76.1634959528817]
We present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. We evaluate a wide set of closed- and open-weight models and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching.
arXiv Detail & Related papers (2025-02-26T08:43:47Z)
Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks. These models can produce hallucinations, particularly in domains with incomplete knowledge. We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z)
Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o. GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities. The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z)
Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT. AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z)
Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization. We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization. We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z)
Baichuan2-Sum: Instruction Finetune Baichuan2-7B Model for Dialogue Summarization [12.45299260235282]
We propose an instruction fine-tuning model: Baichuan2-Sum, for role-oriented diaglouge summarization. By setting different instructions for different roles, the model can learn from the dialogue interactions and output the expected summaries. Experiments demonstrate that the proposed model achieves the new state-of-the-art results on two public dialogue summarization datasets.
arXiv Detail & Related papers (2024-01-27T20:20:39Z)
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning. We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
A Comparative Study on Language Models for Task-Oriented Dialogue Systems [14.634286037008017]
In task-oriented dialogue (ToD) systems, language models can be used for end-to-end training. BART and T5 outperform GPT-based models in BLEU and F1 scores and achieve state-of-the-art performance in a ToD system.
arXiv Detail & Related papers (2022-01-21T13:24:25Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.