Related papers: Preference Optimization for Combinatorial Optimization Problems

Preference Optimization for Combinatorial Optimization Problems

URL: http://arxiv.org/abs/2505.08735v1
Date: Tue, 13 May 2025 16:47:00 GMT
Title: Preference Optimization for Combinatorial Optimization Problems
Authors: Mingjun Pan, Guanquan Lin, You-Wei Luo, Bin Zhu, Zhien Dai, Lijun Sun, Chun Yuan,
Abstract summary: Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
Score: 54.87466279363487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-processing to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

Related papers

VAGPO: Vision-augmented Asymmetric Group Preference Optimization for the Routing Problems [2.150410718150006]
We propose a novel Vision-Augmented Asymmetric Group Preference Optimization (VAGPO) approach for solving the routing problems.<n>By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies.<n> Experimental results show that the proposed VAGPO not only achieves highly competitive solution quality but also exhibits strong generalization to larger instances without re-training.
arXiv Detail & Related papers (2025-08-03T14:19:12Z)
RL-finetuning LLMs from on- and off-policy data with a single algorithm [53.70731390624718]
We introduce a novel reinforcement learning algorithm (AGRO) for fine-tuning large-language models.<n>AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model.<n>We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence.
arXiv Detail & Related papers (2025-03-25T12:52:38Z)
BOPO: Neural Combinatorial Optimization via Best-anchored and Objective-guided Preference Optimization [17.694852175354555]
Preference Optimization for Combinatorial Optimization (POCO) is a training paradigm that leverages solution preferences via objective values.<n>POCO is architecture-agnostic, enabling integration with existing NCO models, and establishes preference optimization as a principled framework for optimization.
arXiv Detail & Related papers (2025-03-10T17:45:30Z)
Graph-attention-based Casual Discovery with Trust Region-navigated Clipping Policy Optimization [13.75709067982844]
We propose a trust region-navigated clipping policy optimization method for causal discovery.<n>We also propose a refined graph attention encoder called SDGAT to boost the efficient encoding of variables.<n>With these improvements, the proposed method outperforms former RL method in both synthetic and benchmark datasets.
arXiv Detail & Related papers (2024-12-27T10:50:43Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Learning Adaptive Evolutionary Computation for Solving Multi-Objective Optimization Problems [3.3266268089678257]
This paper proposes a framework that integrates MOEAs with adaptive parameter control using Deep Reinforcement Learning (DRL) The DRL policy is trained to adaptively set the values that dictate the intensity and probability of mutation for solutions during optimization. We show the learned policy is transferable, i.e., the policy trained on a simple benchmark problem can be directly applied to solve the complex warehouse optimization problem.
arXiv Detail & Related papers (2022-11-01T22:08:34Z)
Online Control of Adaptive Large Neighborhood Search using Deep Reinforcement Learning [4.374837991804085]
We introduce a Deep Reinforcement Learning based approach called DR-ALNS that selects operators, adjusts parameters, and controls the acceptance criterion throughout the search. We evaluate the proposed method on a problem with orienteering weights and time windows, as presented in an IJCAI competition. The results show that our approach outperforms vanilla ALNS, ALNS tuned with Bayesian optimization, and two state-of-the-art DRL approaches.
arXiv Detail & Related papers (2022-11-01T21:33:46Z)
Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems [2.6782615615913348]
This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return. We show that our model can generalise to various route problems.
arXiv Detail & Related papers (2022-05-31T09:51:48Z)
Multi-Agent Deep Reinforcement Learning in Vehicular OCC [14.685237010856953]
We introduce a spectral efficiency optimization approach in vehicular OCC. We model the optimization problem as a Markov decision process (MDP) to enable the use of solutions that can be applied online. We verify the performance of our proposed scheme through extensive simulations and compare it with various variants of our approach and a random method.
arXiv Detail & Related papers (2022-05-05T14:25:54Z)
Automatically Learning Compact Quality-aware Surrogates for Optimization Problems [55.94450542785096]
Solving optimization problems with unknown parameters requires learning a predictive model to predict the values of the unknown parameters and then solving the problem using these values. Recent work has shown that including the optimization problem as a layer in a complex training model pipeline results in predictions of iteration of unobserved decision making. We show that we can improve solution quality by learning a low-dimensional surrogate model of a large optimization problem.
arXiv Detail & Related papers (2020-06-18T19:11:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.