Related papers: LLMs as Method Actors: A Model for Prompt Engineering and Architecture

LLMs as Method Actors: A Model for Prompt Engineering and Architecture

URL: http://arxiv.org/abs/2411.05778v2
Date: Mon, 11 Nov 2024 21:09:42 GMT
Title: LLMs as Method Actors: A Model for Prompt Engineering and Architecture
Authors: Colin Doyle,
Abstract summary: We introduce "Method Actors" as a mental model for guiding LLM prompt engineering and prompt architecture. We show that a "Method Actors" approach can significantly improve LLM performance over both a vanilla and "Chain of Thoughts" approach. We also test OpenAI's newest model designed specifically for complex reasoning tasks, o1-preview.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce "Method Actors" as a mental model for guiding LLM prompt engineering and prompt architecture. Under this mental model, LLMs should be thought of as actors; prompts as scripts and cues; and LLM responses as performances. We apply this mental model to the task of improving LLM performance at playing Connections, a New York Times word puzzle game that prior research identified as a challenging benchmark for evaluating LLM reasoning. Our experiments with GPT-4o show that a "Method Actors" approach can significantly improve LLM performance over both a vanilla and "Chain of Thoughts" approach. A vanilla approach solves 27% of Connections puzzles in our dataset and a "Chain of Thoughts" approach solves 41% of puzzles, whereas our strongest "Method Actor" approach solves 86% of puzzles. We also test OpenAI's newest model designed specifically for complex reasoning tasks, o1-preview. When asked to solve a puzzle all at once, o1-preview solves 79% of Connections puzzles in our dataset, and when allowed to build puzzle solutions one guess at a time over multiple API calls, o1-preview solves 100% of the puzzles. Incorporating a "Method Actor" prompt architecture increases the percentage of puzzles that o1-preview solves perfectly from 76% to 87%.

Related papers

Solving Situation Puzzles with Large Language Model and External Reformulation [6.793639595476304]
We show that large language models (LLMs) cannot perform well on reasoning that requires multiple rounds of dialogue. We propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of Q&A. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles.
arXiv Detail & Related papers (2025-03-24T07:05:55Z)
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities. We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems. We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter? [36.14795256060537]
We develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2. Third, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains.
arXiv Detail & Related papers (2024-07-20T07:43:07Z)
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries [39.438904598467154]
We study how large language models (LLMs) solve complex multi-step problems. understanding how the latent step is computed internally is key to understanding the overall computation. We propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer.
arXiv Detail & Related papers (2024-06-18T16:44:13Z)
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types. We show that even the most advanced LLMs fail to solve these problems end-to-end in text. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)
MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts [5.91695168183101]
This paper presents team MasonTigers submission to the SemEval-2024 Task 9. It provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques.
arXiv Detail & Related papers (2024-03-22T06:31:49Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
PuzzleBench: Can LLMs Solve Challenging First-Order Combinatorial Reasoning Problems? [27.696027301600793]
We present PuzzleBench, a dataset of 31 such challenging problems along with a few solved instances for each problem. These problems are all first order, i.e., they can be instantiated with problem instances of varying sizes, and most of them are NP-hard. We first observe that LLMs, even when aided by symbolic solvers, perform rather poorly on our dataset. In response, we propose a new approach, Puzzle-LM, which combines LLMs with both symbolic solvers and interpreter.
arXiv Detail & Related papers (2024-02-04T20:56:09Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.