REFINER: Reasoning Feedback on Intermediate Representations
- URL: http://arxiv.org/abs/2304.01904v2
- Date: Sun, 4 Feb 2024 12:15:18 GMT
- Title: REFINER: Reasoning Feedback on Intermediate Representations
- Authors: Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine
Bosselut, Robert West, and Boi Faltings
- Abstract summary: We introduce REFINER, a framework for finetuning language models to generate intermediate inferences.
REFINER works by interacting with a critic model that provides automated feedback on the reasoning.
Empirical evaluations show significant improvements over baseline LMs of comparable scale.
- Score: 47.36251998678097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) have recently shown remarkable performance on reasoning
tasks by explicitly generating intermediate inferences, e.g., chain-of-thought
prompting. However, these intermediate inference steps may be inappropriate
deductions from the initial context and lead to incorrect final predictions.
Here we introduce REFINER, a framework for finetuning LMs to explicitly
generate intermediate reasoning steps while interacting with a critic model
that provides automated feedback on the reasoning. Specifically, the critic
provides structured feedback that the reasoning LM uses to iteratively improve
its intermediate arguments. Empirical evaluations of REFINER on three diverse
reasoning tasks show significant improvements over baseline LMs of comparable
scale. Furthermore, when using GPT-3.5 or ChatGPT as the reasoner, the trained
critic significantly improves reasoning without finetuning the reasoner.
Finally, our critic model is trained without expensive human-in-the-loop data
but can be substituted with humans at inference time.
Related papers
- Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse [20.258298183228824]
We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations.
This approach is straightforward to implement and requires significantly less training time than prior methods.
arXiv Detail & Related papers (2022-07-19T20:07:17Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.