Related papers: DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

URL: http://arxiv.org/abs/2601.14711v1
Date: Wed, 21 Jan 2026 06:58:44 GMT
Title: DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs
Authors: Mingxuan Song, Yusen Huo, Bohan Zhou, Shenglin Yin, Zhen Xiao, Jieyi Long, Zhilin Zhang, Chuan Yu,
Abstract summary: Large Language Models offer a promising alternative for AIGB.<n>They lack the numerical precision required for fine-grained optimization.<n>We propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages.<n>Our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.
Score: 21.30516760599435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.

Related papers

AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising [8.53485049764747]
AHBid is an Adaptable Hierarchical Bidding framework that integrates generative planning with real-time control.<n>Experiments conducted on large-scale offline datasets and through online A/B tests demonstrate the effectiveness of AHBid.
arXiv Detail & Related papers (2026-02-26T06:07:28Z)
RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation [17.34586562700226]
In online advertising, advertising text plays a critical role in attracting user engagement and driving advertiser value.<n>We propose RELATE, a reinforcement learning-based end-to-end framework that unifies generation and objective alignment within a single model.<n>To better capture ultimate advertiser value beyond click-level signals, We incorporate conversion-oriented metrics into the objective and jointly model them with compliance constraints as multi-dimensional rewards.
arXiv Detail & Related papers (2026-02-12T10:00:55Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Multi-Objective Reward and Preference Optimization: Theory and Algorithms [3.316593788543852]
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models.<n>ACPO, e-COP, warmPref-PS, PSPL, and MOPO advance RL across average-cost, episodic, and preference-driven paradigms.<n> Collectively, the thesis unifies RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.
arXiv Detail & Related papers (2025-12-11T12:51:21Z)
Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning [34.343514432589586]
This paper proposes textbf, a novel framework that integrates Low-Rank Adaptation (LoRA) with a continuous fine-tuning strategy.<n> Experiments on 15 diverse datasets show that DEAL consistently outperforms baseline methods.<n>These findings demonstrate the potential of our approach to advance continual adaptation in Large Language Models.
arXiv Detail & Related papers (2025-09-23T12:55:57Z)
Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z)
RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1 [20.92548890511589]
This paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs)<n> RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty.
arXiv Detail & Related papers (2025-06-24T01:39:34Z)
Enhancing Decision-Making of Large Language Models via Actor-Critic [28.870961806283425]
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks.<n>Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes.<n>This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations.
arXiv Detail & Related papers (2025-06-04T14:58:27Z)
Adaptive Resource Allocation Optimization Using Large Language Models in Dynamic Wireless Environments [25.866960634041092]
Current solutions rely on domain-specific architectures or techniques, and a general DL approach for constrained optimization remains undeveloped.<n>We propose a large language model for resource allocation (LLM-RAO) to address the complex resource allocation problem while adhering to constraints.<n>LLM-RAO achieves up to a 40% performance enhancement compared to conventional DL methods and up to an $80$% improvement over analytical approaches.
arXiv Detail & Related papers (2025-02-04T12:56:59Z)
Large Language Model driven Policy Exploration for Recommender Systems [50.70228564385797]
offline RL policies trained on static user data are vulnerable to distribution shift when deployed in dynamic online environments.<n>Online RL-based RS also face challenges in production deployment due to the risks of exposing users to untrained or unstable policies.<n>Large Language Models (LLMs) offer a promising solution to mimic user objectives and preferences for pre-training policies offline.<n>We propose an Interaction-Augmented Learned Policy (iALP) that utilizes user preferences distilled from an LLM.
arXiv Detail & Related papers (2025-01-23T16:37:44Z)
Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning [50.9692060692705]
This paper introduces $textbfLanguage Models for $textbfMo$tion Control ($textbfLaMo$), a general framework based on Decision Transformers for offline RL.<n>Our framework highlights four crucial components:.<n>Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method,.<n>In particular, our method demonstrates superior performance in scenarios with limited data samples.
arXiv Detail & Related papers (2023-10-31T16:24:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.