StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- URL: http://arxiv.org/abs/2406.08747v2
- Date: Thu, 31 Oct 2024 03:16:13 GMT
- Title: StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
- Authors: Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee,
- Abstract summary: Large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment.
We introduce StreamBench, a benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence.
Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
- Score: 63.54557575233165
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios. Source code: https://github.com/stream-bench/stream-bench. Benchmark website: https://stream-bench.github.io.
Related papers
- The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement [49.687224320842105]
Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions.
In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback.
arXiv Detail & Related papers (2025-03-20T10:42:33Z) - SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding [56.78088668917983]
We introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains.
We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos.
Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding.
arXiv Detail & Related papers (2025-02-15T14:29:44Z) - Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs [44.80420740455364]
M2WF is a framework for improving large language models' one-time code generation.
Unlike prior methods, it minimizes dependency on curated data and adapts to various coding scenarios.
The code and framework will be publicly available on GitHub and HuggingFace.
arXiv Detail & Related papers (2025-01-14T07:16:43Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Training Agents with Weakly Supervised Feedback from Large Language Models [19.216542820742607]
This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM.
Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction.
Tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4.
arXiv Detail & Related papers (2024-11-29T08:47:04Z) - Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning [14.156753196673598]
This paper introduces a novel approach to produce high-quality reasoning traces for Large Language Models fine-tuning.
Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions.
We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time.
arXiv Detail & Related papers (2024-10-29T17:50:31Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Sparse Rewards Can Self-Train Dialogue Agents [22.799506097310008]
We introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback.
We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ.
We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks.
arXiv Detail & Related papers (2024-09-06T21:00:57Z) - Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)
LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.
LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization [37.58752947129519]
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies.
Traditional benchmarks, which are often static, fail to capture the continually changing information landscape.
Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts.
arXiv Detail & Related papers (2024-05-14T09:31:31Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - La-MAML: Look-ahead Meta Learning for Continual Learning [14.405620521842621]
We propose Look-ahead MAML (La-MAML), a fast optimisation-based meta-learning algorithm for online-continual learning, aided by a small episodic memory.
La-MAML achieves performance superior to other replay-based, prior-based and meta-learning based approaches for continual learning on real-world visual classification benchmarks.
arXiv Detail & Related papers (2020-07-27T23:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.