When does In-context Learning Fall Short and Why? A Study on
Specification-Heavy Tasks
- URL: http://arxiv.org/abs/2311.08993v1
- Date: Wed, 15 Nov 2023 14:26:30 GMT
- Title: When does In-context Learning Fall Short and Why? A Study on
Specification-Heavy Tasks
- Authors: Hao Peng, Xiaozhi Wang, Jianhui Chen, Weikai Li, Yunjia Qi, Zimu Wang,
Zhili Wu, Kaisheng Zeng, Bin Xu, Lei Hou, Juanzi Li
- Abstract summary: In-context learning (ICL) has become the default method for using large language models (LLMs)
We find that ICL falls short of handling specification-heavy tasks, which are tasks with complicated and extensive task specifications.
We identify three primary reasons: inability to specifically understand context, misalignment in task schema comprehension with humans, and inadequate long-text understanding ability.
- Score: 54.71034943526973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context learning (ICL) has become the default method for using large
language models (LLMs), making the exploration of its limitations and
understanding the underlying causes crucial. In this paper, we find that ICL
falls short of handling specification-heavy tasks, which are tasks with
complicated and extensive task specifications, requiring several hours for
ordinary humans to master, such as traditional information extraction tasks.
The performance of ICL on these tasks mostly cannot reach half of the
state-of-the-art results. To explore the reasons behind this failure, we
conduct comprehensive experiments on 18 specification-heavy tasks with various
LLMs and identify three primary reasons: inability to specifically understand
context, misalignment in task schema comprehension with humans, and inadequate
long-text understanding ability. Furthermore, we demonstrate that through
fine-tuning, LLMs can achieve decent performance on these tasks, indicating
that the failure of ICL is not an inherent flaw of LLMs, but rather a drawback
of existing alignment methods that renders LLMs incapable of handling
complicated specification-heavy tasks via ICL. To substantiate this, we perform
dedicated instruction tuning on LLMs for these tasks and observe a notable
improvement. We hope the analyses in this paper could facilitate advancements
in alignment methods enabling LLMs to meet more sophisticated human demands.
Related papers
- LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns [47.57912649802414]
We study the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns.
We find that LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.
arXiv Detail & Related papers (2024-09-24T07:34:50Z) - Defining Boundaries: A Spectrum of Task Feasibility for Large Language Models [6.008311204104302]
Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities.
This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the required skills surpassing their capabilities.
arXiv Detail & Related papers (2024-08-11T22:58:23Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.