StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- URL: http://arxiv.org/abs/2406.08747v2
- Date: Thu, 31 Oct 2024 03:16:13 GMT
- Title: StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- Authors: Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee,
- Abstract summary: Large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment.
We introduce StreamBench, a benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence.
Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
- Score: 63.54557575233165
- License:
- Abstract: Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.
Related papers
- Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning [14.156753196673598]
This paper introduces a novel approach to produce high-quality reasoning traces for Large Language Models fine-tuning.
Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions.
We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time.
arXiv Detail & Related papers (2024-10-29T17:50:31Z) - Sparse Rewards Can Self-Train Dialogue Agents [22.799506097310008]
We introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback.
We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ.
We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks.
arXiv Detail & Related papers (2024-09-06T21:00:57Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization [37.58752947129519]
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies.
Traditional benchmarks, which are often static, fail to capture the continually changing information landscape.
Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts.
arXiv Detail & Related papers (2024-05-14T09:31:31Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Feedback Loops With Language Models Drive In-Context Reward Hacking [78.9830398771605]
We show that feedback loops can cause in-context reward hacking (ICRH)
We identify and study two processes that lead to ICRH: output-refinement and policy-refinement.
As AI development accelerates, the effects of feedback loops will proliferate.
arXiv Detail & Related papers (2024-02-09T18:59:29Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - La-MAML: Look-ahead Meta Learning for Continual Learning [14.405620521842621]
We propose Look-ahead MAML (La-MAML), a fast optimisation-based meta-learning algorithm for online-continual learning, aided by a small episodic memory.
La-MAML achieves performance superior to other replay-based, prior-based and meta-learning based approaches for continual learning on real-world visual classification benchmarks.
arXiv Detail & Related papers (2020-07-27T23:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.