Related papers: CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

URL: http://arxiv.org/abs/2603.01973v1
Date: Mon, 02 Mar 2026 15:27:31 GMT
Title: CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production
Authors: Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou, Xiao Li, Zhengping Zhou, Raymond Zeng, Gelin Zhou, Shigan Chu, Ajay Thampi, Wancen Mu, Nathan Shuster, Ketong Wang, Lin Chen, Jason Brewer, Derek Hao Hu, Alexander McCauley, Jason Weston, Sem Park, Na Zhang, Kevin Tang,
Abstract summary: CharacterFlywheel is an iterative process for improving large language models (LLMs) in production social chat applications.<n>We refined models across 15 generations using data from both internal and external real-user traffic.<n>We conducted 7-day A/B tests showing consistent engagement improvements.
Score: 52.85500933801205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.

Related papers

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs [49.61011897610774]
How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
arXiv Detail & Related papers (2026-02-09T15:47:14Z)
Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support [8.580317550913028]
We introduce an Agent-in-theLoop framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system.<n>Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations.
arXiv Detail & Related papers (2025-10-08T05:57:04Z)
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models [53.31664844941449]
ProActive Self-Refinement (PASR) is a novel method for improving large language models (LLMs)<n>Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context.<n>We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR.
arXiv Detail & Related papers (2025-08-18T13:07:21Z)
Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models [1.96238419451815]
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data.<n>We introduce a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data.<n>This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o
arXiv Detail & Related papers (2025-04-25T06:48:55Z)
Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z)
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation [20.41379322900742]
We introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning.
arXiv Detail & Related papers (2024-07-15T15:33:45Z)
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web [69.6913064185993]
Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks.<n>Despite the promise, their performance on real-world applications is still underexplored.<n>We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
arXiv Detail & Related papers (2023-11-30T17:50:47Z)
Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in Self-Refined Open-Source Models [53.859446823312126]
SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance. Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks.
arXiv Detail & Related papers (2023-10-11T15:56:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.