Related papers: Evaluating the Robustness to Instructions of Large Language Models

Evaluating the Robustness to Instructions of Large Language Models

URL: http://arxiv.org/abs/2308.14306v3
Date: Mon, 27 Nov 2023 17:43:20 GMT
Title: Evaluating the Robustness to Instructions of Large Language Models
Authors: Yuansheng Ni, Sichao Jiang, Xinyu wu, Hui Shen, Yuli Zhou
Abstract summary: Fine-tuning Large Language Models (LLMs) can boost their zero-shot capabilities on novel tasks. We evaluate six models including Alpaca, Vicuna, WizardLM, and Traditional Task-oriented Models(Flan-T5-XL/XXL, T0++) We find that the robustness of different scales of FLAN-T5 models to RE instruction is worse than the robustness to QA instruction.
Score: 6.947956990248856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Instruction fine-tuning has risen to prominence as a potential method for enhancing the zero-shot capabilities of Large Language Models (LLMs) on novel tasks. This technique has shown an exceptional ability to boost the performance of moderately sized LLMs, sometimes even reaching performance levels comparable to those of much larger model variants. The focus is on the robustness of instruction-tuned LLMs to seen and unseen tasks. We conducted an exploration of six models including Alpaca, Vicuna, WizardLM, and Traditional Task-oriented Models(Flan-T5-XL/XXL, T0++) using real-world relation extraction datasets as case studies. We carried out a comprehensive evaluation of these instruction-following LLMs which have been tuned based on open-domain instructions and task-oriented instructions. The main discussion is their performance and robustness towards instructions. We have observed that in most cases, the model's performance in dealing with unfamiliar instructions tends to worsen significantly, and the robustness of the model for RE instructions deteriorates compared to QA. Further, we discovered that up until a certain parameter size threshold (3B), the performance of the FLAN-T5 model improves as the parameter count increases. The robustness of different scales of FLAN-T5 models to RE instruction is worse than the robustness to QA instruction.

Related papers

How well do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse [0.15469452301122177]
Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks.<n>We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in classroom transcripts.<n>We found that while zero-shot performance was moderate, providing comprehensive examples significantly improved performance for state-of-the-art models.
arXiv Detail & Related papers (2025-12-22T22:08:32Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study [8.827173113748701]
We study character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We find that, on average, self-denoising achieves substantially higher performance gains than alternative strategies.
arXiv Detail & Related papers (2025-04-03T16:17:56Z)
Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks [0.0]
Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP) This study evaluates the continual fine-tuning of various open-source LLMs on key NLU tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities.
arXiv Detail & Related papers (2025-04-01T23:06:55Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models [48.455388608863785]
We introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following tasks. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules) More recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness.
arXiv Detail & Related papers (2024-06-28T15:34:26Z)
RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions [43.19966425619236]
We utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions.
arXiv Detail & Related papers (2024-02-26T09:30:55Z)
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions. Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z)
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models [32.95155349925248]
We propose a modular paradigm ReWOO that detaches the reasoning process from external observations, thus significantly reducing token consumption. We show that ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark. Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.
arXiv Detail & Related papers (2023-05-23T00:16:48Z)
Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors [11.28397947587596]
Fine-tuning large language models (LLMs) on large-scale instruction-following datasets substantially improves their performance on a wide range of NLP tasks. However, even advanced instruction-tuned LLMs still fail to outperform small LMs on relation extraction (RE) We propose QA4RE, a framework that aligns RE with question answering (QA), a predominant task in instruction-tuning datasets.
arXiv Detail & Related papers (2023-05-18T17:48:03Z)
Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance. We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.