Related papers: Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support

Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support

URL: http://arxiv.org/abs/2510.06674v2
Date: Thu, 09 Oct 2025 02:01:51 GMT
Title: Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support
Authors: Cen Mia Zhao, Tiantian Zhang, Hanchen Su, Yufeng Wayne Zhang, Shaowei Su, Mingzhi Xu, Yu Elaine Liu, Wei Han, Jeremy Werner, Claire Na Cheng, Yashar Mehdad,
Abstract summary: We introduce an Agent-in-theLoop framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system.<n>Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations.
Score: 8.580317550913028
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce an Agent-in-the-Loop (AITL) framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system. Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations: (1) pairwise response preferences, (2) agent adoption and rationales, (3) knowledge relevance checks, and (4) identification of missing knowledge. These feedback signals seamlessly feed back into models' updates, reducing retraining cycles from months to weeks. Our production pilot involving US-based customer support agents demonstrated significant improvements in retrieval accuracy (+11.7% recall@75, +14.8% precision@8), generation quality (+8.4% helpfulness) and agent adoption rates (+4.5%). These results underscore the effectiveness of embedding human feedback loops directly into operational workflows to continuously refine LLM-based customer support system.

Related papers

RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems [25.34524038198569]
We propose RGAlign-Rec, a closed-loop alignment framework for proactive intent prediction.<n>We also introduce Ranking-Guided Alignment (RGA), a multi-stage training paradigm.<n>Our framework achieves a 0.12% gain in GAUC, leading to a significant 3.52% relative reduction in error rate, and a 0.56% improvement in Recall@3.
arXiv Detail & Related papers (2026-02-13T14:38:02Z)
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification [71.98473277917962]
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving.<n>We propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics.<n>We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification.
arXiv Detail & Related papers (2026-01-22T09:47:31Z)
Language Model Agents Under Attack: A Cross Model-Benchmark of Profit-Seeking Behaviors in Customer Service [15.896831937702174]
Cross-domain benchmark of profit-seeking direct prompt injection in customer-service interactions.<n>100 realistic attack scripts grouped into five technique families.<n>Attacks are highly domain-dependent (airline support is most exploitable) and technique-dependent (payload is most consistently effective)
arXiv Detail & Related papers (2025-12-30T18:57:52Z)
Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement [8.230420096371407]
We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA's Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees.<n>We built a closed-loop system that addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning.<n>For routing, we replaced a Llama 3.1B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10x reduction in model size, and 70% latency improvement.
arXiv Detail & Related papers (2025-10-30T23:41:06Z)
Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control [50.316067647636196]
This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policyReinforcement learning algorithm evaluated on mobile app control tasks.<n>SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach.<n>We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions.
arXiv Detail & Related papers (2025-09-01T18:55:27Z)
Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.254037571895404]
Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning.<n>This paper introduces Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality.<n>We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation.
arXiv Detail & Related papers (2025-08-03T01:56:03Z)
Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.<n>In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.<n> DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z)
MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification [5.666070277424383]
MAG-V is a framework to generate a dataset of questions that mimic customer queries.<n>Our synthetic data can improve agent performance on actual customer queries.
arXiv Detail & Related papers (2024-11-28T19:36:11Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Self-Boosting Large Language Models with Synthetic Preference Data [97.94185115047999]
We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities. SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
arXiv Detail & Related papers (2024-10-09T14:57:31Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems [10.58737969057445]
We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models. We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text. Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
arXiv Detail & Related papers (2023-09-08T09:39:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.