Evaluating Large Language Models in Analysing Classroom Dialogue
- URL: http://arxiv.org/abs/2402.02380v3
- Date: Fri, 23 Feb 2024 02:19:09 GMT
- Title: Evaluating Large Language Models in Analysing Classroom Dialogue
- Authors: Yun Long, Haifeng Luo, Yu Zhang
- Abstract summary: The study involves datasets from a middle school, encompassing classroom dialogues across mathematics and Chinese classes.
These dialogues were manually coded by educational experts and then analyzed using a customised GPT-4 model.
Results indicate substantial time savings with GPT-4, and a high degree of consistency in coding between the model and human coders, with some discrepancies in specific codes.
- Score: 8.793491910415897
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study explores the application of Large Language Models (LLMs),
specifically GPT-4, in the analysis of classroom dialogue, a crucial research
task for both teaching diagnosis and quality improvement. Recognizing the
knowledge-intensive and labor-intensive nature of traditional qualitative
methods in educational research, this study investigates the potential of LLM
to streamline and enhance the analysis process. The study involves datasets
from a middle school, encompassing classroom dialogues across mathematics and
Chinese classes. These dialogues were manually coded by educational experts and
then analyzed using a customised GPT-4 model. This study focuses on comparing
manual annotations with the outputs of GPT-4 to evaluate its efficacy in
analyzing educational dialogues. Time efficiency, inter-coder agreement, and
inter-coder reliability between human coders and GPT-4 are evaluated. Results
indicate substantial time savings with GPT-4, and a high degree of consistency
in coding between the model and human coders, with some discrepancies in
specific codes. These findings highlight the strong potential of LLM in
teaching evaluation and facilitation.
Related papers
- Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o.
GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities.
The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z) - InFoBench: Evaluating Instruction Following Ability in Large Language
Models [57.27152890085759]
Decomposed Requirements Following Ratio (DRFR) is a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions.
We present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories.
arXiv Detail & Related papers (2024-01-07T23:01:56Z) - GPT-4 Surpassing Human Performance in Linguistic Pragmatics [0.0]
This study investigates the ability of Large Language Models (LLMs) to comprehend and interpret linguistic pragmatics.
Using Grice's communication principles, LLMs and human subjects were evaluated based on their responses to various dialogue-based tasks.
The findings revealed the superior performance and speed of LLMs, particularly GPT4, over human subjects in interpreting pragmatics.
arXiv Detail & Related papers (2023-12-15T05:40:15Z) - From Voices to Validity: Leveraging Large Language Models (LLMs) for
Textual Analysis of Policy Stakeholder Interviews [14.135107583299277]
This study explores the integration of Large Language Models (LLMs) with human expertise to enhance text analysis of stakeholder interviews regarding K-12 education policy within one U.S. state.
Using a mixed-methods approach, human experts developed a codebook and coding processes as informed by domain knowledge and unsupervised topic modeling results.
Results reveal that while GPT-4 thematic coding aligned with human coding by 77.89% at specific themes, expanding to broader themes increased congruence to 96.02%, surpassing traditional Natural Language Processing (NLP) methods by over 25%.
arXiv Detail & Related papers (2023-12-02T18:55:14Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task.
Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency.
To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z) - Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements [28.630542719519855]
This work empirically investigates the performance of large language models (LLMs) in generating empathetic responses.
Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations.
arXiv Detail & Related papers (2023-10-08T12:21:24Z) - A Large Language Model Approach to Educational Survey Feedback Analysis [0.0]
This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys.
arXiv Detail & Related papers (2023-09-29T17:57:23Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.