Direct Profit Estimation Using Uplift Modeling under Clustered Network Interference
- URL: http://arxiv.org/abs/2509.01558v1
- Date: Mon, 01 Sep 2025 15:38:13 GMT
- Title: Direct Profit Estimation Using Uplift Modeling under Clustered Network Interference
- Authors: Bram van den Akker,
- Abstract summary: Uplift modeling is a key technique for promotion optimization in recommender systems.<n>Recent developments in interference-aware estimators such as Additive Inverse Propensity Weighting have not found their way into the uplift modeling literature yet.
- Score: 0.33842793760651557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Uplift modeling is a key technique for promotion optimization in recommender systems, but standard methods typically fail to account for interference, where treating one item affects the outcomes of others. This violation of the Stable Unit Treatment Value Assumption (SUTVA) leads to suboptimal policies in real-world marketplaces. Recent developments in interference-aware estimators such as Additive Inverse Propensity Weighting (AddIPW) have not found their way into the uplift modeling literature yet, and optimising policies using these estimators is not well-established. This paper proposes a practical methodology to bridge this gap. We use the AddIPW estimator as a differentiable learning objective suitable for gradient-based optimization. We demonstrate how this framework can be integrated with proven response transformation techniques to directly optimize for economic outcomes like incremental profit. Through simulations, we show that our approach significantly outperforms interference-naive methods, especially as interference effects grow. Furthermore, we find that adapting profit-centric uplift strategies within our framework can yield superior performance in identifying the highest-impact interventions, offering a practical path toward more profitable incentive personalization.
Related papers
- MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - MAPO: Mixed Advantage Policy Optimization [120.96975697212065]
We propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO)<n>We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories.
arXiv Detail & Related papers (2025-09-23T09:37:16Z) - TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making [75.29820290660065]
This paper proposes Thought-Centric Preference Optimization ( TCPO) for effective embodied decision-making.<n>It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation.<n>Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM.
arXiv Detail & Related papers (2025-09-10T11:16:21Z) - Adaptive Preference Optimization with Uncertainty-aware Utility Anchor [33.74005997646761]
offline preference optimization methods are efficient for large language models (LLMs) alignment.<n>We propose a general framework for offline preference optimization methods, which introduces an anchoring function to estimate the uncertainties brought from preference data annotation.<n>Our method enables training even in scenarios where the data is unpaired, significantly enhancing data utilization efficiency.
arXiv Detail & Related papers (2025-09-03T10:20:08Z) - PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning [5.922794597824468]
We propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling.<n>Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
arXiv Detail & Related papers (2025-08-28T09:18:26Z) - Heterogeneous Causal Learning for Optimizing Aggregated Functions in User Growth [0.7100520098029438]
We propose a novel treatment effect optimization methodology to enhance user growth marketing.<n>By leveraging deep learning, our algorithm learns from past experiments to optimize user selection and reward allocation.<n>We experimentally demonstrate that our proposed constrained and direct optimization algorithms significantly outperform state-of-the-art methods by over $20%$.
arXiv Detail & Related papers (2025-07-07T22:08:45Z) - Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO [19.5712961932773]
We revisit direct preference optimization (DPO) and demonstrate that its loss theoretically admits a decomposed reformulation.<n>We introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types.
arXiv Detail & Related papers (2025-05-29T10:23:22Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - On-the-fly Preference Alignment via Principle-Guided Decoding [27.50204023448716]
We introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to align model outputs with human preferences during inference.<n>OPAD achieves competitive or superior performance in both general and personalized alignment tasks.
arXiv Detail & Related papers (2025-02-20T02:23:09Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Metalearners for Ranking Treatment Effects [1.469168639465869]
We show how learning to rank can maximize the area under a policy's incremental profit curve.
We show how learning to rank can maximize the area under a policy's incremental profit curve.
arXiv Detail & Related papers (2024-05-03T15:31:18Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Model-based Causal Bayesian Optimization [74.78486244786083]
We introduce the first algorithm for Causal Bayesian Optimization with Multiplicative Weights (CBO-MW)
We derive regret bounds for CBO-MW that naturally depend on graph-related quantities.
Our experiments include a realistic demonstration of how CBO-MW can be used to learn users' demand patterns in a shared mobility system.
arXiv Detail & Related papers (2023-07-31T13:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.