AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks
- URL: http://arxiv.org/abs/2602.01563v1
- Date: Mon, 02 Feb 2026 02:56:11 GMT
- Title: AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks
- Authors: Nan Hu, Han Li, Jimeng Sun, Lu Wang, Fangkai Yang, Bo Qiao, Pu Zhao, David Dai, Mengyu Liu, Yuefeng Zhan, Jianjin Zhang, Weihao Han, Allen Sun, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Denvy Deng, Feng Sun, Qi Zhang,
- Abstract summary: Large Language Models (LLMs) have shown strong capabilities in Natural Language Understanding and Generation.<n> deploying them directly in online advertising systems is often impractical due to strict millisecond-level latency constraints.<n>We introduce AdNanny, a unified reasoning-centric LLM that serves as a shared backbone for offline advertising tasks.
- Score: 57.725430699642004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown strong capabilities in Natural Language Understanding and Generation, but deploying them directly in online advertising systems is often impractical due to strict millisecond-level latency constraints. This has motivated the use of LLMs offline to improve retrieval, ranking, and recommendation models. Existing solutions typically fine-tune separate LLMs for individual tasks such as query-ad relevance labeling, keyword-based query generation, and user profiling. This results in redundant models, high maintenance cost, and limited performance gains despite substantial overlap in domain knowledge and reasoning patterns. We introduce AdNanny, a unified reasoning-centric LLM that serves as a shared backbone for offline advertising tasks. AdNanny is obtained by fine-tuning a public 671B-parameter DeepSeek-R1 checkpoint using a scalable training system that supports hybrid dense-MoE parallelism. We construct reasoning-augmented corpora that pair structured supervision with step-by-step natural language explanations. A multi-task supervised fine-tuning stage with adaptive reweighting enables AdNanny to handle diverse labeling and generation tasks in a consistent reasoning format. This is followed by reinforcement learning using downstream advertising metrics to align model behavior with online retrieval and ranking objectives. AdNanny is deployed in production within Bing Ads, where it significantly reduces manual labeling effort and improves accuracy across multiple offline tasks. By consolidating many task-specific models into a single reasoning-centric foundation model, AdNanny provides a scalable and cost-effective solution for large-scale advertising systems.
Related papers
- From Reasoning LLMs to BERT: A Two-Stage Distillation Framework for Search Relevance [20.096802351171377]
e-commerce search systems face strict latency requirements that prevent the direct application of Large Language Models.<n>We propose a two-stage reasoning distillation framework to transfer reasoning capabilities from a powerful teacher LLM to a lightweight, deployment-friendly student model.<n>Our framework achieves significant improvements across multiple metrics, validating its effectiveness and practical value.
arXiv Detail & Related papers (2025-10-13T06:46:43Z) - LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation [16.960316035628008]
At LinkedIn, job person fit task requires analyzing a candidate's public profile against job requirements to produce both a fit assessment and a detailed explanation.<n>We introduce LANTERN, a novel LLM knowledge distillation framework tailored specifically for job person fit tasks.<n>We show that LANTERN significantly improves task specific metrics for both job person fit and explanation.
arXiv Detail & Related papers (2025-10-07T01:10:02Z) - Multi-task Offline Reinforcement Learning for Online Advertising in Recommender Systems [54.709976343045824]
Current offline reinforcement learning (RL) methods face substantial challenges when applied to sparse advertising scenarios.<n>We propose MTORL, a novel multi-task offline RL model that targets two key objectives.<n>We employ multi-task learning to decode actions and rewards, simultaneously addressing channel recommendation and budget allocation.
arXiv Detail & Related papers (2025-06-29T05:05:13Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Offline Reinforcement Learning for LLM Multi-Step Reasoning [15.687002884103537]
OREO (Offline Reasoning Optimization) is an offline reinforcement learning method for enhancing multi-step reasoning.<n>It reduces the need to collect pairwise data and enables better credit assignment.<n>It surpasses existing offline learning methods on multi-step reasoning benchmarks.
arXiv Detail & Related papers (2024-12-20T18:49:45Z) - Boosting LLM-based Relevance Modeling with Distribution-Aware Robust Learning [14.224921308101624]
We propose a novel Distribution-Aware Robust Learning framework (DaRL) for relevance modeling.<n>DaRL has been deployed online to serve the Alipay's insurance product search.
arXiv Detail & Related papers (2024-12-17T03:10:47Z) - Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback [52.763620660061115]
ONI is a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function.<n>We explore a range of algorithmic choices for reward modeling with varying complexity.<n>Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment.
arXiv Detail & Related papers (2024-10-30T13:52:43Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.<n>We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.<n> Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.