Assessing the efficacy of large language models in generating accurate
teacher responses
- URL: http://arxiv.org/abs/2307.04274v1
- Date: Sun, 9 Jul 2023 22:32:46 GMT
- Title: Assessing the efficacy of large language models in generating accurate
teacher responses
- Authors: Yann Hicke, Abhishek Masand, Wentao Guo, Tushaar Gangavarapu
- Abstract summary: This study attempts to assess the generative abilities of large language models in providing informative and helpful insights to students.
We present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT.
Our experimental findings on the Teacher-Student Chatroom subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
- Score: 0.5774786149181391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: (Tack et al., 2023) organized the shared task hosted by the 18th Workshop on
Innovative Use of NLP for Building Educational Applications on generation of
teacher language in educational dialogues. Following the structure of the
shared task, in this study, we attempt to assess the generative abilities of
large language models in providing informative and helpful insights to
students, thereby simulating the role of a knowledgeable teacher. To this end,
we present an extensive evaluation of several benchmarking generative models,
including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and
fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we
fine-tuned the Flan-T5 model using reinforcement learning. Our experimental
findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of
GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
We hypothesize that several dataset characteristics, including sampling,
representativeness, and dialog completeness, pose significant challenges to
fine-tuning, thus contributing to the poor generalizability of the fine-tuned
models. Finally, we note the need for these generative models to be evaluated
with a metric that relies not only on dialog coherence and matched language
modeling distribution but also on the model's ability to showcase pedagogical
skills.
Related papers
- Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o.
GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities.
The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z) - Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT.
AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs.
Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z) - Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z) - Baichuan2-Sum: Instruction Finetune Baichuan2-7B Model for Dialogue Summarization [12.45299260235282]
We propose an instruction fine-tuning model: Baichuan2-Sum, for role-oriented diaglouge summarization.
By setting different instructions for different roles, the model can learn from the dialogue interactions and output the expected summaries.
Experiments demonstrate that the proposed model achieves the new state-of-the-art results on two public dialogue summarization datasets.
arXiv Detail & Related papers (2024-01-27T20:20:39Z) - INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models [39.46610170563634]
INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models.
We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
arXiv Detail & Related papers (2023-06-07T20:12:29Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - Opportunities and Challenges in Neural Dialog Tutoring [54.07241332881601]
We rigorously analyze various generative language models on two dialog tutoring datasets for language learning.
We find that although current approaches can model tutoring in constrained learning scenarios, they perform poorly in less constrained scenarios.
Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring.
arXiv Detail & Related papers (2023-01-24T11:00:17Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - A Comparative Study on Language Models for Task-Oriented Dialogue
Systems [14.634286037008017]
In task-oriented dialogue (ToD) systems, language models can be used for end-to-end training.
BART and T5 outperform GPT-based models in BLEU and F1 scores and achieve state-of-the-art performance in a ToD system.
arXiv Detail & Related papers (2022-01-21T13:24:25Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.