Related papers: EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

URL: http://arxiv.org/abs/2508.06046v1
Date: Fri, 08 Aug 2025 06:10:47 GMT
Title: EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
Authors: Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Zhibo Yang, Xingsheng Zhang, Luxi Xing, Qiang Zhou, Chen Zhang,
Abstract summary: We propose a self-Evolving Pairwise Reasoning (EvolvR) framework for story evaluation.<n>The framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy.<n>The evaluator trained on the refined data is deployed as a reward model to guide the story generation task.
Score: 17.37840331449749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

Related papers

Learning Ordinal Probabilistic Reward from Preferences [25.069054134899744]
We introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM)<n>Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.<n>Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT)
arXiv Detail & Related papers (2026-02-13T06:43:02Z)
CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria [48.70940362676624]
We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
arXiv Detail & Related papers (2026-01-28T07:46:13Z)
One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z)
OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning [41.49024599460379]
Reward models (RMs) have become essential for aligning large language models (LLMs)<n>We introduce OpenRM, a tool-augmented long-form reward model that judges open-ended responses by invoking external tools to gather relevant evidence.<n>Experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches.
arXiv Detail & Related papers (2025-10-28T17:02:46Z)
A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs [0.0]
This paper presents a framework for benchmarking Large Language Models (LLMs) on the task of classifying complex industrial records.<n>To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool.<n>We quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores.
arXiv Detail & Related papers (2025-09-08T15:48:17Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation [11.238548725286122]
We introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples.<n>Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations.<n>We show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance.
arXiv Detail & Related papers (2025-01-01T09:00:10Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.<n>In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.<n>We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.<n>The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.<n>We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.