Boosting Theory-of-Mind Performance in Large Language Models via
Prompting
- URL: http://arxiv.org/abs/2304.11490v3
- Date: Wed, 26 Apr 2023 04:02:04 GMT
- Title: Boosting Theory-of-Mind Performance in Large Language Models via
Prompting
- Authors: Shima Rahimi Moghaddam, Christopher J. Honey
- Abstract summary: This study measures the ToM performance of GPT-4 and three GPT-3.5 variants.
We investigated the effectiveness of in-context learning in improving ToM comprehension.
- Score: 2.538209532048867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel in many tasks in 2023, but they still face
challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require
understanding agents' beliefs, goals, and mental states, are essential for
common-sense reasoning involving humans, making it crucial to enhance LLM
performance in this area. This study measures the ToM performance of GPT-4 and
three GPT-3.5 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates
the effectiveness of in-context learning in improving their ToM comprehension.
We evaluated prompts featuring two-shot chain of thought reasoning and
step-by-step thinking instructions. We found that LLMs trained with
Reinforcement Learning from Human Feedback (RLHF) (all models excluding
Davinci-2) improved their ToM accuracy via in-context learning. GPT-4 performed
best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell
short of the 87% human accuracy on the test set. However, when supplied with
prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM
accuracy, with GPT-4 reaching 100%. These results demonstrate that appropriate
prompting enhances LLM ToM reasoning, and they underscore the context-dependent
nature of LLM cognitive capacities.
Related papers
- (WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges [0.0]
This work explores the potential of Microsoft's PHI-3citeAbdin2024, a compact yet efficient LLM, for answering multiple-choice questions.
Results show a remarkable improvement in PHI-3.5's MCQ handling post-fine-tuning, with perplexity decreasing from 4.68 to 2.27, and accuracy rising from 62% to 90.8%.
arXiv Detail & Related papers (2025-01-03T00:56:46Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models [11.453585039783901]
LEAF: Learning and Evaluation Augmented by Fact-Checking, is a novel approach designed to enhance the factual reliability of large language models (LLMs)
The first strategy, Fact-Check-Then-RAG, improves Retrieval-Augmented Generation (RAG) by incorporating fact-checking results to guide the retrieval process without updating model parameters.
The second strategy, Learning from Fact-Checks via Self-Training, involves supervised fine-tuning (SFT) on fact-checked responses or applying Simple Preference Optimization (SimPO) with fact-checking as a ranking mechanism.
arXiv Detail & Related papers (2024-10-31T00:18:05Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.
GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Mixed Distillation Helps Smaller Language Model Better Reasoning [27.934081882868902]
We introduce Mixed Distillation (MD) framework, which capitalizes on the strengths of Program of Thought (PoT) and Chain of Thought (CoT) capabilities within large language models (LLMs)
Our experimental results show that MD significantly enhances the single-path and multi-path reasoning ability of smaller models in various tasks.
arXiv Detail & Related papers (2023-12-17T14:28:28Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - Evaluating Large Language Models in Theory of Mind Tasks [11.622327857276389]
Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks.
The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario.
To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios.
arXiv Detail & Related papers (2023-02-04T03:50:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.