StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- URL: http://arxiv.org/abs/2406.08747v1
- Date: Thu, 13 Jun 2024 02:08:28 GMT
- Title: StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- Authors: Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee,
- Abstract summary: We introduce StreamBench, a benchmark designed to evaluate the continuous improvement of large language model (LLM) agents over an input-feedback sequence.
StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance.
Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
- Score: 63.54557575233165
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
Related papers
- CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM.
We demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights [38.03704123835915]
Large-scale generative language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following.
We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback.
Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D.
arXiv Detail & Related papers (2024-06-20T17:45:02Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - CoEvol: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation [33.33513657902765]
We propose CoEvol, an LLM-based multi-agent cooperation framework for the improvement of responses to instructions.
Empirically, models equipped with CoEvol outperform competitive baselines evaluated by MT-Bench and AlpacaEval.
arXiv Detail & Related papers (2024-06-11T08:35:37Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization [37.58752947129519]
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies.
Traditional benchmarks, which are often static, fail to capture the continually changing information landscape.
Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts.
arXiv Detail & Related papers (2024-05-14T09:31:31Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.