Process-based Self-Rewarding Language Models
- URL: http://arxiv.org/abs/2503.03746v1
- Date: Wed, 05 Mar 2025 18:58:44 GMT
- Title: Process-based Self-Rewarding Language Models
- Authors: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong,
- Abstract summary: Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios.<n>Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance.<n>We propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization.
- Score: 47.119444722849025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.
Related papers
- Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection.
Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - SLMRec: Distilling Large Language Models into Small for Sequential Recommendation [38.51895517016953]
Sequential Recommendation task involves predicting the next item a user is likely to interact with, given their past interactions.<n>Recent research demonstrates the great impact of LLMs on sequential recommendation systems.<n>Due to the huge size of LLMs, it is inefficient and impractical to apply a LLM-based model in real-world platforms.
arXiv Detail & Related papers (2024-05-28T07:12:06Z) - Can formal argumentative reasoning enhance LLMs performances? [0.3659498819753633]
We present a pipeline (MQArgEng) to evaluate the effect of introducing computational argumentation semantics on the performance of Large Language Models (LLMs)
Exploratory results indicate that MQArgEng provides a moderate performance gain in most of the examined topical categories and, as such, show promise and warrant further research.
arXiv Detail & Related papers (2024-05-16T22:09:31Z) - Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models [0.8133739801185272]
The alignments of reasoning abilities between smaller and larger Language Models are largely conducted via Supervised Fine-Tuning (SFT)
We propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-refine their abilities.
Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios.
arXiv Detail & Related papers (2024-05-01T09:10:27Z) - Are Large Language Models Good Prompt Optimizers? [65.48910201816223]
We conduct a study to uncover the actual mechanism of LLM-based Prompt Optimization.
Our findings reveal that the LLMs struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge.
We introduce a new "Automatic Behavior Optimization" paradigm, which directly optimize the target model's behavior in a more controllable manner.
arXiv Detail & Related papers (2024-02-03T09:48:54Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.