REFINER: Reasoning Feedback on Intermediate Representations
- URL: http://arxiv.org/abs/2304.01904v2
- Date: Sun, 4 Feb 2024 12:15:18 GMT
- Title: REFINER: Reasoning Feedback on Intermediate Representations
- Authors: Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine
Bosselut, Robert West, and Boi Faltings
- Abstract summary: We introduce REFINER, a framework for finetuning language models to generate intermediate inferences.
REFINER works by interacting with a critic model that provides automated feedback on the reasoning.
Empirical evaluations show significant improvements over baseline LMs of comparable scale.
- Score: 47.36251998678097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) have recently shown remarkable performance on reasoning
tasks by explicitly generating intermediate inferences, e.g., chain-of-thought
prompting. However, these intermediate inference steps may be inappropriate
deductions from the initial context and lead to incorrect final predictions.
Here we introduce REFINER, a framework for finetuning LMs to explicitly
generate intermediate reasoning steps while interacting with a critic model
that provides automated feedback on the reasoning. Specifically, the critic
provides structured feedback that the reasoning LM uses to iteratively improve
its intermediate arguments. Empirical evaluations of REFINER on three diverse
reasoning tasks show significant improvements over baseline LMs of comparable
scale. Furthermore, when using GPT-3.5 or ChatGPT as the reasoner, the trained
critic significantly improves reasoning without finetuning the reasoner.
Finally, our critic model is trained without expensive human-in-the-loop data
but can be substituted with humans at inference time.
Related papers
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning [46.411313304605564]
Critic-V is a framework inspired by the Actor-Critic paradigm to boost the reasoning capability of vision-language models (VLMs)
The Reasoner generates reasoning paths based on visual and textual inputs, and the Critic provides constructive critique to refine these paths.
evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks.
arXiv Detail & Related papers (2024-11-27T10:28:57Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse [20.258298183228824]
We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations.
This approach is straightforward to implement and requires significantly less training time than prior methods.
arXiv Detail & Related papers (2022-07-19T20:07:17Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.