Related papers: Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

URL: http://arxiv.org/abs/2407.10627v1
Date: Mon, 15 Jul 2024 11:26:07 GMT
Title: Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena
Authors: Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, Weizhu Chen,
Abstract summary: We introduce Arena Learning, an innovative offline strategy designed to simulate arena battles using AI-driven annotations. Arena Learning ensures precise evaluations and maintains consistency between offline simulations and online competitions. We apply Arena Learning to train our target model, WizardLM-$beta$, and demonstrate significant performance enhancements.
Score: 126.70522244144088
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Assessing the effectiveness of large language models (LLMs) presents substantial challenges. The method of conducting human-annotated battles in an online Chatbot Arena is a highly effective evaluative technique. However, this approach is limited by the costs and time required for human annotation. In this paper, we introduce Arena Learning, an innovative offline strategy designed to simulate these arena battles using AI-driven annotations to evaluate battle outcomes, thus facilitating the continuous improvement of the target model through both supervised fine-tuning and reinforcement learning. Arena Learning comprises two key elements. First, it ensures precise evaluations and maintains consistency between offline simulations and online competitions via WizardArena, a pipeline developed to accurately predict the Elo rankings of various models using a meticulously designed offline test set. Our results demonstrate that WizardArena's predictions closely align with those from the online Arena. Second, it involves the continuous improvement of training data based on the battle results and the refined model. We establish a data flywheel to iteratively update the training data by highlighting the weaknesses of the target model based on its battle results, enabling it to learn from the strengths of multiple different models. We apply Arena Learning to train our target model, WizardLM-$\beta$, and demonstrate significant performance enhancements across various metrics. This fully automated training and evaluation pipeline sets the stage for continuous advancements in various LLMs via post-training. Notably, Arena Learning plays a pivotal role in the success of WizardLM-2, and this paper serves both as an exploration of its efficacy and a foundational study for future discussions related to WizardLM-2 and its derivatives.

Related papers

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z)
Playpen: An Environment for Exploring Learning Through Conversational Interaction [81.67330926729015]
We investigate whether Dialogue Games can also serve as a source of feedback signals for learning.<n>We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play.<n>We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills.
arXiv Detail & Related papers (2025-04-11T14:49:33Z)
BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues [7.377606500245465]
The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal. We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a multimodal game context. We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task.
arXiv Detail & Related papers (2025-01-18T18:06:03Z)
Training Dynamics of a 1.7B LLaMa Model: A Data-Efficient Approach [10.39475177812483]
We share insights gained from training DMaS-LLaMa-Lite on approximately 20 billion tokens of data. We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output. By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.
arXiv Detail & Related papers (2024-12-17T21:15:52Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
InternLM2 Technical Report [159.70692271378581]
This paper introduces InternLM2, an open-source Large Language Models (LLMs) that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-03-26T00:53:24Z)
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT. On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt. On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z)
Improving Online Continual Learning Performance and Stability with Temporal Ensembles [30.869268130955145]
We study the effect of model ensembling as a way to improve performance and stability in online continual learning. We use a lightweight temporal ensemble that computes the exponential moving average of the weights (EMA) at test time.
arXiv Detail & Related papers (2023-06-29T09:53:24Z)
Learning to Modulate pre-trained Models in RL [22.812215561012874]
Fine-tuning a pre-trained model often suffers from catastrophic forgetting. Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly. We propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model.
arXiv Detail & Related papers (2023-06-26T17:53:05Z)
Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task. The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance. We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z)
Perceiving the World: Question-guided Reinforcement Learning for Text-based Games [64.11746320061965]
This paper introduces world-perceiving modules, which automatically decompose tasks and prune actions by answering questions about the environment. We then propose a two-phase training framework to decouple language learning from reinforcement learning, which further improves the sample efficiency.
arXiv Detail & Related papers (2022-03-20T04:23:57Z)
Accelerating Reinforcement Learning for Reaching using Continuous Curriculum Learning [6.703429330486276]
We focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching tasks. Specifically, we propose a precision-based continuous curriculum learning (PCCL) method in which the requirements are gradually adjusted during the training process. This approach is tested using a Universal Robot 5e in both simulation and real-world multi-goal reach experiments.
arXiv Detail & Related papers (2020-02-07T10:08:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.