Evidence to Generate (E2G): A Single-agent Two-step Prompting for
Context Grounded and Retrieval Augmented Reasoning
- URL: http://arxiv.org/abs/2401.05787v1
- Date: Thu, 11 Jan 2024 09:49:15 GMT
- Title: Evidence to Generate (E2G): A Single-agent Two-step Prompting for
Context Grounded and Retrieval Augmented Reasoning
- Authors: Md Rizwan Parvez
- Abstract summary: We introduce Evidence to Generate (E2G), a novel single-agent, two-step prompting framework.
Instead of unverified reasoning claims, E2G focuses exclusively on the thought sequences explicitly mentioned in the context.
tool achieves remarkable results robustly across a wide range of knowledge-intensive reasoning and generation tasks.
- Score: 3.117335706912261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While chain-of-thought (CoT) prompting has revolutionized how LLMs perform
reasoning tasks, its current methods and variations (e.g, Self-consistency,
ReACT, Reflexion, Tree-of-Thoughts (ToT), Cumulative Reasoning (CR)) suffer
from limitations like slowness, limited context grounding, hallucination and
inconsistent outputs. To overcome these challenges, we introduce Evidence to
Generate (E2G), a novel single-agent, two-step prompting framework. Instead of
unverified reasoning claims, this innovative approach leverages the power of
"evidence for decision making" by first focusing exclusively on the thought
sequences (the series of intermediate steps) explicitly mentioned in the
context which then serve as extracted evidence, guiding the LLM's output
generation process with greater precision and efficiency. This simple yet
powerful approach unlocks the true potential of chain-of-thought like
prompting, paving the way for faster, more reliable, and more contextually
aware reasoning in LLMs. \tool achieves remarkable results robustly across a
wide range of knowledge-intensive reasoning and generation tasks, surpassing
baseline approaches with state-of-the-art LLMs. For example, (i) on LogiQA
benchmark using GPT-4 as backbone model, \tool achieves a new state-of-the
Accuracy of 53.8% exceeding CoT by 18%, ToT by 11%, CR by 9% (ii) a variant of
E2G with PaLM2 outperforms the variable-shot performance of Gemini Ultra by 0.9
F1 points, reaching an F1 score of 83.3 on a subset of DROP.
Related papers
- START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.
START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.
It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.
However, a significant discrepancy persists between benchmark performances and real-world applications.
We introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance.
We present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems.
arXiv Detail & Related papers (2024-12-17T18:12:47Z) - An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? [19.13886382791074]
This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales.
We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales.
We propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT)
arXiv Detail & Related papers (2024-10-31T12:07:44Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation [16.350747493026432]
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs)
We propose the textbfStrategic Chain-of-Thought (SCoT) to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps.
SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers.
arXiv Detail & Related papers (2024-09-05T06:28:05Z) - Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents [25.825941077332182]
This paper is the first to investigate the performance of Large Language Models (LLMs) as decision-makers in the context of Dueling Bandits (DB)
Our results reveal that LLMs, particularly GPT-4 Turbo, quickly identify the Condorcet winner, thus outperforming existing state-of-the-art algorithms in terms of weak regret.
To overcome these issues, we introduce a hybrid algorithm: LLM-Enhanced Adaptive Dueling (LEAD), which takes advantage of both in-context decision-making capabilities of LLMs and theoretical guarantees inherited from classic DB algorithms.
arXiv Detail & Related papers (2024-07-02T02:18:14Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for
In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning.
DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z) - Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs)
Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process.
We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.