Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks
- URL: http://arxiv.org/abs/2503.04378v1
- Date: Thu, 06 Mar 2025 12:30:24 GMT
- Title: Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks
- Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev,
- Abstract summary: Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1.<n>We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback.<n>We show that performance on Arena Hard, a benchmark of Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses.
- Score: 7.686622572497795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect data for and train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.
Related papers
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking [16.441081996257576]
We propose a simple yet effective test-time scaling approach Multi-round Thinking.
This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds.
Experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements.
arXiv Detail & Related papers (2025-03-25T17:19:38Z) - R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.
Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.
In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z) - Rank1: Test-Time Compute for Reranking in Information Retrieval [45.356614696154075]
Rank1 is the first reranking model trained to take advantage of test-time compute.
We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO.
arXiv Detail & Related papers (2025-02-25T18:14:06Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.
We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Deep Feedback Inverse Problem Solver [141.26041463617963]
We present an efficient, effective, and generic approach towards solving inverse problems.
We leverage the feedback signal provided by the forward process and learn an iterative update model.
Our approach does not have any restrictions on the forward process; it does not require any prior knowledge either.
arXiv Detail & Related papers (2021-01-19T16:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.