GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
- URL: http://arxiv.org/abs/2511.00802v1
- Date: Sun, 02 Nov 2025 04:47:17 GMT
- Title: GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents
- Authors: Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard,
- Abstract summary: textitGrowthHacker is a benchmark with agent and baseline methods on large-scale real-world datasets.<n>We develop the textittwo_agent framework, which reduces system complexity while preserving optimization effectiveness.<n>Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7%.
- Score: 0.32839375042867835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.
Related papers
- FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents [25.60249598832918]
FT-Dojo is an interactive environment comprising 13 tasks across 5 domains.<n>We develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback.
arXiv Detail & Related papers (2026-03-02T10:37:11Z) - LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning [12.792070502265616]
Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields.<n>Such data often contains numerous low-quality samples, necessitating effective data processing (DP)
arXiv Detail & Related papers (2026-01-28T08:37:34Z) - ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment [1.6968020497268546]
ROAD is a novel framework that treats optimization as a dynamic debug investigation rather than a search.<n>Road is highly sample-efficient, achieving a 5.6 percent increase in success rate and a 3.8 percent increase in search accuracy.<n>These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive training.
arXiv Detail & Related papers (2025-12-30T07:31:34Z) - Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale [19.60416591361918]
Fine-Grained Optimization (FGO) is a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging.<n> evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%.
arXiv Detail & Related papers (2025-05-06T20:50:27Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science [5.064778712920176]
Large language models (LLMs) are increasingly used to automate data analysis through executable code generation.<n>We present $itAIRepr, an $itA$nalyst - $itI$nspector framework for automatically evaluating and improving the $itRepr$oducibility of LLM-generated data analysis.
arXiv Detail & Related papers (2025-02-23T01:15:50Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning [14.702694298483445]
Tree-Search Enhanced LLM Agents (SELA) is an agent-based system that leverages Monte Carlo Tree Search (MCTS) to optimize the AutoML process.
SELA represents pipeline configurations as trees, enabling agents to conduct experiments intelligently and iteratively refine their strategies.
In an extensive evaluation across 20 machine learning datasets, we compare the performance of traditional and agent-based AutoML methods.
arXiv Detail & Related papers (2024-10-22T17:56:08Z) - Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving.<n>Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods.<n>We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z) - Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI [3.9773527114058855]
We propose a novel methodology that combines the generative capabilities of Large Language Models with the fast and accurate retrieval capabilities of vector databases.
The developed model, Generative Text Retrieval (GTR), is adaptable to both unstructured and structured data with minor refinement.
The refined model, Generative Tabular Text Retrieval (GTR-T), demonstrated its efficiency in large database querying.
arXiv Detail & Related papers (2024-06-13T23:08:06Z) - DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR)
In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle.
In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.