Related papers: Post-edits Are Preferences Too

Post-edits Are Preferences Too

URL: http://arxiv.org/abs/2410.02320v2
Date: Tue, 8 Oct 2024 08:09:36 GMT
Title: Post-edits Are Preferences Too
Authors: Nathaniel Berger, Stefan Riezler, Miriam Exel, Matthias Huck,
Abstract summary: In machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We show that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings.
Score: 11.351365352611658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks.

Related papers

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences? [20.004349891563706]
After pre-training, large language models are aligned with human preferences based on pairwise comparisons.<n>We introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.
arXiv Detail & Related papers (2025-05-29T17:59:20Z)
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game. We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models. CaPO incorporates the general preference from multiple reward models without human annotated data. Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment [16.230186347702737]
We propose Simultaneous Weighted Preference Optimization (SWEPO) SWEPO incorporates multiple responses per query and prioritizes those that deviate most from the average reward. We prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $mathcalO(tfrac1sqrtk)$.
arXiv Detail & Related papers (2024-12-05T21:50:22Z)
VPO: Leveraging the Number of Votes in Preference Optimization [5.200545764106177]
We introduce a technique that leverages user voting data to better align with diverse subjective preferences. We develop the Vote-based Preference Optimization framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs.
arXiv Detail & Related papers (2024-10-30T10:39:34Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches. We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both [6.102274021710727]
Direct Reward Distillation and policy-Optimization (DRDO) is a supervised knowledge distillation-based preference alignment method. DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation. Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods.
arXiv Detail & Related papers (2024-10-11T02:19:11Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations. DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences. We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Direct Preference Optimization with an Offset [58.7977683502207]
Direct preference optimization (DPO) is a successful strategy for aligning large language models with human preferences. We propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning.
arXiv Detail & Related papers (2024-02-16T10:55:38Z)
Efficient Machine Translation Corpus Generation [3.441021278275805]
Method is based on online training of a custom MT quality estimation metric on-the-fly as linguists perform post-edits. Online estimator is used to prioritize worse hypotheses for post-editing, and auto-close best hypotheses without post-editing.
arXiv Detail & Related papers (2023-06-20T18:46:47Z)
PePe: Personalized Post-editing Model utilizing User-generated Post-edits [28.749742163017544]
We introduce a personalized automatic post-editing framework to address this challenge. We first collect post-editing data that connotes the user preference from a live machine translation system. We then propose a model that combines a discriminator module and user-specific parameters on the APE framework.
arXiv Detail & Related papers (2022-09-21T06:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.