Soft Self-Consistency Improves Language Model Agents
- URL: http://arxiv.org/abs/2402.13212v2
- Date: Wed, 5 Jun 2024 19:50:19 GMT
- Title: Soft Self-Consistency Improves Language Model Agents
- Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal,
- Abstract summary: Current "sample and select" methods rely on majority voting to score answers.
Soft Self-Consistency (SOFT-SC) replaces SC's discontinuous scoring with a continuous score computed from model likelihoods.
For a fixed number of samples, SOFT-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping, and a 4.7% increase for an interactive household game.
- Score: 57.66282463340297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (SOFT-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. SOFT-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, SOFT-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that SOFT-SC can be applied to both open-source and black-box models.
Related papers
- Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling [9.44858963874474]
Self-Consistency (SC) results in significant computational costs proportional to the number of samples generated.
We propose Reasoning-Aware Self-Consistency (RASC), an innovative early-stopping framework that adjusts the number of sample generations.
RASC significantly reduces sample usage by an average of 80% while maintaining or improving accuracy up to 5% compared to the original SC.
arXiv Detail & Related papers (2024-08-30T05:14:59Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation [20.138831477848615]
We propose Fine-Grained Self-Consistency (FSC) to optimize output quality by effectively fine-grained consensus knowledge from multiple samples.
The effectiveness of FSC is demonstrated through extensive experiments on various tasks, including summarization, code generation, and mathematical reasoning.
arXiv Detail & Related papers (2024-07-02T08:38:31Z) - Atomic Self-Consistency for Better Long Form Generations [12.753854064540636]
Atomic Self-Consistency (ASC) is a technique for improving the recall of relevant information in a long-form response.
ASC follows recent work, Universal Self-Consistency (USC) in using multiple samples to improve the long-form response.
Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample.
arXiv Detail & Related papers (2024-05-21T18:05:44Z) - Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step
Reasoning [15.088675135566646]
Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning.
We propose a simple and scalable sampling process, textbfEarly-Stopping textbfSelf-textbfConsistency (ESC) to greatly reduce the cost of SC without sacrificing performance.
arXiv Detail & Related papers (2024-01-19T04:03:59Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Universal Self-Consistency for Large Language Model Generation [72.6761480346095]
Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on challenging tasks.
We propose Universal Self-Consistency (USC), which leverages large language models (LLMs) to select the most consistent answer.
arXiv Detail & Related papers (2023-11-29T02:07:09Z) - Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning
and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency.
We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question.
Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.