SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
- URL: http://arxiv.org/abs/2512.22170v1
- Date: Wed, 17 Dec 2025 14:28:23 GMT
- Title: SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
- Authors: Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Yixue Hao, Yuan Zhou, Qinglin Lu, Long Hu, Junchi Yan,
- Abstract summary: Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
- Score: 53.19726629537694
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
Related papers
- Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control [82.30868101940068]
We propose a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models.<n>Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape.<n>We show that even smaller models benefit, and weak models can generate effective training curricula for stronger ones.
arXiv Detail & Related papers (2026-02-17T15:31:32Z) - Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z) - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z) - OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [38.1645520104553]
We introduce OpenRubrics, a large-scale collection of (prompt,explicit) pairs for training rubric-generation and rubric-based reward models.<n>To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Generation (CRG), which derives both hard rules (implicit qualities) and principles (implicit qualities) by contrasting preferred and rejected responses.<n>Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling.
arXiv Detail & Related papers (2025-10-09T03:31:26Z) - RewardDance: Reward Scaling in Visual Generation [28.934614189005856]
We introduce RewardDance, a scalable reward modeling framework that overcomes barriers through a novel generative reward paradigm.<n>By reformulating the reward score as the model's probability of predicting a "yes" token, RewardDance intrinsically aligns reward objectives with Vision-Language Models.<n>We show that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation.
arXiv Detail & Related papers (2025-09-10T17:59:31Z) - STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z) - Multi-Metric Preference Alignment for Generative Speech Restoration [15.696247605348383]
We propose a multi-metric preference alignment strategy for generative models.<n>We observe consistent and significant performance gains across three diverse generative paradigms.<n>Our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels.
arXiv Detail & Related papers (2025-08-24T07:05:10Z) - One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z) - Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling [27.11560841914813]
We introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses.<n>We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails.
arXiv Detail & Related papers (2025-07-08T21:56:33Z) - Energy-Based Reward Models for Robust Language Model Alignment [9.843359827321194]
We introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework for Reward Models (RMs)<n>EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations.<n> Empirical evaluations demonstrate significant improvements in robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks.
arXiv Detail & Related papers (2025-04-17T17:47:15Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.