Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization
- URL: http://arxiv.org/abs/2602.08261v1
- Date: Mon, 09 Feb 2026 04:41:30 GMT
- Title: Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization
- Authors: Binglin Wu, Yingyi Zhang, Xianneng Li, Ruyue Deng, Chuan Yue, Weiru Zhang, Xiaoyi Zeng,
- Abstract summary: PRO-Bid is a constraint-aware generative auto-bidding framework based on two synergistic mechanisms.<n>It achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.
- Score: 8.514099612407062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Auto-bidding systems aim to maximize marketing value while satisfying strict efficiency constraints such as Target Cost-Per-Action (CPA). Although Decision Transformers provide powerful sequence modeling capabilities, applying them to this constrained setting encounters two challenges: 1) standard Return-to-Go conditioning causes state aliasing by neglecting the cost dimension, preventing precise resource pacing; and 2) standard regression forces the policy to mimic average historical behaviors, thereby limiting the capacity to optimize performance toward the constraint boundary. To address these challenges, we propose PRO-Bid, a constraint-aware generative auto-bidding framework based on two synergistic mechanisms: 1) Constraint-Decoupled Pareto Representation (CDPR) decomposes global constraints into recursive cost and value contexts to restore resource perception, while reweighting trajectories based on the Pareto frontier to focus on high-efficiency data; and 2) Counterfactual Regret Optimization (CRO) facilitates active improvement by utilizing a global outcome predictor to identify superior counterfactual actions. By treating these high-utility outcomes as weighted regression targets, the model transcends historical averages to approach the optimal constraint boundary. Extensive experiments on two public benchmarks and online A/B tests demonstrate that PRO-Bid achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.
Related papers
- Constrained Group Relative Policy Optimization [18.3888203751956]
We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization.<n>We show that a naive multi-component treatment in advantage estimation can break constrained learning.<n>We also evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success.
arXiv Detail & Related papers (2026-02-05T16:44:23Z) - GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL [21.30558932544297]
Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in decision-making while satisfying constraints.<n>Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process.<n>We propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward and constraint satisfaction.
arXiv Detail & Related papers (2026-02-05T05:44:48Z) - Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning [2.0719232729184145]
offline Reinforcement Learning (RL) relies on policy constraints to shape performance.<n>Most existing methods commit to a single constraint family.<n>We propose Continuous Constraint Interpolation (CCI), a unified optimization framework.
arXiv Detail & Related papers (2026-01-30T14:21:41Z) - C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding [9.446373834962895]
Decision Transformer (DT) shows promise for generative auto-bidding by capturing temporal dependencies.<n>DT suffers from insufficient cross-correlation modeling among state, action, and return-to-go sequences.<n>We propose C2, a novel framework enhancing DT with two core innovations.
arXiv Detail & Related papers (2026-01-28T05:08:02Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making [75.29820290660065]
This paper proposes Thought-Centric Preference Optimization ( TCPO) for effective embodied decision-making.<n>It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation.<n>Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM.
arXiv Detail & Related papers (2025-09-10T11:16:21Z) - Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z) - A Deep Generative Learning Approach for Two-stage Adaptive Robust Optimization [3.124884279860061]
We introduce AGRO, a solution algorithm that performs adversarial generation for two-stage adaptive robust optimization.<n>AGRO generates high-dimensional contingencies that are simultaneously adversarial and realistic.<n>We show that AGRO outperforms the standard column-and-constraint algorithm by up to 1.8% in production-distribution planning and up to 11.6% in power system expansion.
arXiv Detail & Related papers (2024-09-05T17:42:19Z) - Double Duality: Variational Primal-Dual Policy Optimization for
Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure.
Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Algorithm for Constrained Markov Decision Process with Linear
Convergence [55.41644538483948]
An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs.
A new dual approach is proposed with the integration of two ingredients: entropy regularized policy and Vaidya's dual.
The proposed approach is shown to converge (with linear rate) to the global optimum.
arXiv Detail & Related papers (2022-06-03T16:26:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.