TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models
- URL: http://arxiv.org/abs/2310.06762v1
- Date: Tue, 10 Oct 2023 16:38:49 GMT
- Title: TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models
- Authors: Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin,
Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xuanjing
Huang
- Abstract summary: Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
- Score: 52.734140807634624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligned large language models (LLMs) demonstrate exceptional capabilities in
task-solving, following instructions, and ensuring safety. However, the
continual learning aspect of these aligned LLMs has been largely overlooked.
Existing continual learning benchmarks lack sufficient challenge for leading
aligned LLMs, owing to both their simplicity and the models' potential exposure
during instruction tuning. In this paper, we introduce TRACE, a novel benchmark
designed to evaluate continual learning in LLMs. TRACE consists of 8 distinct
datasets spanning challenging tasks including domain-specific tasks,
multilingual capabilities, code generation, and mathematical reasoning. All
datasets are standardized into a unified format, allowing for effortless
automatic evaluation of LLMs. Our experiments show that after training on
TRACE, aligned LLMs exhibit significant declines in both general ability and
instruction-following capabilities. For example, the accuracy of llama2-chat
13B on gsm8k dataset declined precipitously from 28.8\% to 2\% after training
on our datasets. This highlights the challenge of finding a suitable tradeoff
between achieving performance on specific tasks while preserving the original
prowess of LLMs. Empirical findings suggest that tasks inherently equipped with
reasoning paths contribute significantly to preserving certain capabilities of
LLMs against potential declines. Motivated by this, we introduce the
Reasoning-augmented Continual Learning (RCL) approach. RCL integrates
task-specific cues with meta-rationales, effectively reducing catastrophic
forgetting in LLMs while expediting convergence on novel tasks.
Related papers
- Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - LinkGPT: Teaching Large Language Models To Predict Missing Links [23.57145845001286]
Large Language Models (LLMs) have shown promising results on various language and vision tasks.
Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs)
arXiv Detail & Related papers (2024-06-07T04:54:36Z) - Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization [12.885866125783618]
Large Language Models (LLMs) tend to produce inaccurate responses to specific queries.
We construct an adversarial dataset, named as $textbfADT (Adrial dataset for Tokenizer)$ to challenge LLMs' tokenization.
Our empirical results reveal that our ADT is highly effective on challenging the tokenization of leading LLMs, including GPT-4o, Llama-3, Qwen2.5-max and so on.
arXiv Detail & Related papers (2024-05-27T11:39:59Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [10.929547354171723]
This paper introduces Knowledgeable Agents from Language Model Rollouts (KALM)
It extracts knowledge from large language models (LLMs) in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods.
It achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods.
arXiv Detail & Related papers (2024-04-14T13:19:40Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.