Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting
- URL: http://arxiv.org/abs/2310.00272v1
- Date: Sat, 30 Sep 2023 06:25:27 GMT
- Title: Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting
- Authors: Baphumelele Masikisiki, Vukosi Marivate, Yvette Hlope
- Abstract summary: Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks.
The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
- Score: 0.2552922646705803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models, such as Generative Pre-trained Transformer 3 (aka.
GPT-3), have been developed to understand language through the analysis of
extensive text data, allowing them to identify patterns and connections between
words. While LLMs have demonstrated impressive performance across various
text-related tasks, they encounter challenges in tasks associated with
reasoning. To address this challenge, Chain of Thought(CoT) prompting method
has been proposed as a means to enhance LLMs' proficiency in complex reasoning
tasks like solving math word problems and answering questions based on logical
argumentative reasoning. The primary aim of this research is to assess how well
four language models can grade reflective essays of third-year medical
students. The assessment will specifically target the evaluation of critical
thinking skills using CoT prompting.
The research will provide the following contributions; to introduce and
educate on the process of instructing models to evaluate reflective essays from
a dataset they have not been previously trained on; to illustrate the use of
CoT prompting as an instructional approach for training large models to carry
out particular tasks. Our results suggest that among all the models, Llama-7b
performs the least effectively, displaying the highest mean squared error.
Conversely, ChatGPT emerges as the superior model, boasting a higher Cohen
kappa score value of 0.53. Lastly, it's important to note that the selected
models do prioritise user privacy by allowing users to delete their own
conducted conversations.
Related papers
- LLMs are Superior Feedback Providers: Bootstrapping Reasoning for Lie Detection with Self-Generated Feedback [33.14770105185958]
Large Language Models (LLMs) excel at generating human-like dialogues and comprehending text.
We propose a bootstrapping framework that leverages self-generated feedback to enhance LLM reasoning capabilities for lie detection.
We investigate the application of the proposed framework for detecting betrayal and deception in Diplomacy games, and compare it with feedback from professional human players.
arXiv Detail & Related papers (2024-08-25T18:47:55Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Self-Convinced Prompting: Few-Shot Question Answering with Repeated
Introspection [13.608076739368949]
We introduce a novel framework that harnesses the potential of large-scale pre-trained language models.
Our framework processes the output of a typical few-shot chain-of-thought prompt, assesses the correctness of the response, scrutinizes the answer, and ultimately produces a new solution.
arXiv Detail & Related papers (2023-10-08T06:36:26Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs [37.754787051387034]
We propose a representation learning framework called breakpoint modeling.
Our approach trains models in an efficient and end-to-end fashion to build intermediate representations.
We show the benefit of our main breakpoint transformer, based on T5, over conventional representation learning approaches.
arXiv Detail & Related papers (2022-11-15T07:28:14Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z) - Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks.
We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z) - Critical Thinking for Language Models [6.963299759354333]
This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models.
We generate artificial argumentative texts to train and evaluate GPT-2.
We obtain consistent and promising results for NLU benchmarks.
arXiv Detail & Related papers (2020-09-15T15:49:19Z) - Learning an Effective Context-Response Matching Model with
Self-Supervised Tasks for Retrieval-based Dialogues [88.73739515457116]
We introduce four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination.
We jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner.
Experiment results indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection.
arXiv Detail & Related papers (2020-09-14T08:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.