g-DPO: Scalable Preference Optimization for Protein Language Models
- URL: http://arxiv.org/abs/2510.19474v1
- Date: Wed, 22 Oct 2025 11:11:42 GMT
- Title: g-DPO: Scalable Preference Optimization for Protein Language Models
- Authors: Constance Ferragu, Jonathan D. Ziegler, Nicolas Deutschmann, Arthur Lindoulsi, Eli Bixby, Cradle ML Team,
- Abstract summary: g-DPO is a framework that prunes redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations.<n>Across three protein engineering tasks, g-DPO maintains in-silico and in-vitro performance that is statistically indistinguishable from standard DPO, while converging 1.8 to 3.7 times faster.
- Score: 0.9236074230806578
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in-silico and in-vitro performance that is statistically indistinguishable from standard DPO, while converging 1.8 to 3.7 times faster, with greater gains expected as the size of the dataset increases.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening [72.89086338778098]
We propose a two-stage framework for protein-ligand contrastive representation learning.<n>In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone.<n>In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module.<n>This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement.
arXiv Detail & Related papers (2025-11-10T11:57:47Z) - Amortized Active Generation of Pareto Sets [48.56811624922571]
A-GPS is a new framework for online discrete black-box multi-objective optimization.<n>Method employs a class probability estimator to predict non-dominance relations.<n>We show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement.
arXiv Detail & Related papers (2025-10-23T23:49:23Z) - Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models [53.339700196282905]
A key challenge in applying reinforcement learning to large language models (dLLMs) is the intractability of their likelihood functions.<n>We propose a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective.<n> Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
arXiv Detail & Related papers (2025-10-13T17:47:50Z) - Reinforcing Diffusion Models by Direct Group Preference Optimization [19.195805549362074]
Group Preference Optimization (DGPO) learns directly from group-level preferences, which utilize relative information of samples within groups.<n>Results show that DGPO trains around 20 times faster than existing state-of-the-art methods and superior performance on both in-domain and out-of-domain metrics reward.
arXiv Detail & Related papers (2025-10-09T16:40:43Z) - Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering [5.568436850698628]
Sem-DPO is a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency.<n>We show that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text.<n>On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO.
arXiv Detail & Related papers (2025-07-27T05:20:13Z) - Protein Inverse Folding From Structure Feedback [78.27854221882572]
We introduce a novel approach to fine-tune an inverse folding model using feedback from a protein folding model.<n>Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning leads to a significant improvement in average TM-Score.
arXiv Detail & Related papers (2025-06-03T16:02:12Z) - Towards Self-Improvement of Diffusion Models via Group Preference Optimization [10.6096255671291]
Group Preference Optimization (GPO) is an effective self-improvement method that enhances performance without requiring external data.<n>GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points.<n>As a plug-and-play method, no extra overhead is introduced during inference.
arXiv Detail & Related papers (2025-05-16T10:04:57Z) - Preference-Based Alignment of Discrete Diffusion Models [14.874943508610857]
We introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains.<n>Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution.<n>Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches.
arXiv Detail & Related papers (2025-03-11T11:07:35Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.