Related papers: StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

URL: http://arxiv.org/abs/2311.08803v4
Date: Sat, 09 Nov 2024 13:52:50 GMT
Title: StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving
Authors: Chang Gao, Haiyun Jiang, Deng Cai, Shuming Shi, Wai Lam,
Abstract summary: StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts. Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
Score: 76.5322280307861
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing prompting methods suffer from the issues of generalizability and consistency, as they often rely on instance-specific solutions that may not be applicable to other instances and lack task-level consistency across the selected few-shot examples. To address these limitations, we propose a comprehensive framework, StrategyLLM, allowing LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts. It employs four LLM-based agents: strategy generator, executor, optimizer, and evaluator, working together to generate, evaluate, and select promising strategies for a given task. Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2\% $\rightarrow$ 38.8\%), commonsense reasoning (70.3\% $\rightarrow$ 72.5\%), algorithmic reasoning (73.7\% $\rightarrow$ 85.0\%), and symbolic reasoning (30.0\% $\rightarrow$ 79.2\%). Further analysis reveals that StrategyLLM is applicable to various LLMs and demonstrates advantages across numerous scenarios.

Related papers

Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize? [5.589792999813676]
Large language models (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges.<n>We investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving.
arXiv Detail & Related papers (2025-07-15T15:47:47Z)
T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering [49.5489716597489]
We present T$2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity.<n>T$2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question.
arXiv Detail & Related papers (2025-05-23T03:18:02Z)
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment [29.617927643991877]
This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL)<n>We introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions.<n>Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines.
arXiv Detail & Related papers (2025-05-17T04:09:46Z)
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute [55.330813919992465]
This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths.
arXiv Detail & Related papers (2025-04-01T13:13:43Z)
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators [81.09765876000208]
We propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS) Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for evaluators. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS.
arXiv Detail & Related papers (2025-02-18T16:46:47Z)
EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning [69.55982246413046]
We propose explicit policy optimization (EPO) for strategic reasoning. EPO provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment.
arXiv Detail & Related papers (2025-02-18T03:15:55Z)
Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey [1.430963201405577]
Large Language Models (LLM)-based systems rely on a single LLM for all user queries. They often require different preprocessing strategies, levels of reasoning, or knowledge. This paper explores key considerations for integrating routing into LLM-based systems.
arXiv Detail & Related papers (2025-02-01T12:08:38Z)
How Strategic Agents Respond: Comparing Analytical Models with LLM-Generated Responses in Strategic Classification [9.296248945826084]
We propose using strategic advice generated by large language models to simulate human agent responses in Strategic Classification. We examine five critical SC scenarios -- hiring, loan applications, school admissions, personal income, and public assistance programs. We then compare the resulting agent responses with the best responses generated by existing theoretical models.
arXiv Detail & Related papers (2025-01-20T01:39:03Z)
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming [13.246017517159043]
Large language models (LLMs) have recently demonstrated strong potential in solving planning problems. We propose LLpreview, a framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch. We apply LLpreview to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLpreview achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPTo and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2024-10-15T23:20:54Z)
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation [16.350747493026432]
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs) We propose the textbfStrategic Chain-of-Thought (SCoT) to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers.
arXiv Detail & Related papers (2024-09-05T06:28:05Z)
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning [2.9312156642007294]
Spatial reasoning is one of the core commonsense skills that is not purely language-based and requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial descriptions. We construct a large-scale benchmark called $textbfGRASP$, which consists of 16,000 grid-based environments where the agent is tasked with an energy collection problem.
arXiv Detail & Related papers (2024-07-02T02:27:46Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task. We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z)
Self-Guiding Exploration for Combinatorial Problems [2.636330943305939]
Self-Guiding Exploration (SGE) is designed to enhance the performance of solving Combinatorial Problems. SGE operates autonomously, generating multiple thought trajectories for each CP task. It then breaks these trajectories down into actionable subtasks, executes them sequentially, and refines the results to ensure optimal outcomes.
arXiv Detail & Related papers (2024-05-28T08:26:54Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities. When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z)
DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation [53.62727171363384]
We introduce a novel reasoning principle: Dynamic Reflection with Divergent Thinking. Our methodology is dynamic reflection, a process that emulates human learning through probing, critiquing, and reflecting. We evaluate our approach on three datasets using six pre-trained LLMs.
arXiv Detail & Related papers (2023-12-18T16:41:22Z)
Which is better? Exploring Prompting Strategy For LLM-based Metrics [6.681126871165601]
This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
arXiv Detail & Related papers (2023-11-07T06:36:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.