Related papers: Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

URL: http://arxiv.org/abs/2410.13343v1
Date: Thu, 17 Oct 2024 08:52:52 GMT
Title: Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models
Authors: Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, Qi Liu,
Abstract summary: Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. This paper presents Shortcut Suite, a test suite designed to evaluate the impact of shortcuts on LLMs' performance.
Score: 9.854718405054589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs' performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url {https://github.com/yyhappier/ShortcutSuite.git}.

Related papers

SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry. We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations. This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z)
Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering [18.94220625114711]
Large language models (LLMs) perform surprisingly well and outperform human experts on many tasks. This paper integrates and optimized a pipeline for selecting reasoning paths from KG based on LLM. We also propose a simple and effective subgraph retrieval method based on chain of thought (CoT) and page rank.
arXiv Detail & Related papers (2024-04-16T08:28:16Z)
Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers [76.77077447576679]
Shortcuts, mechanisms triggered by features spuriously correlated to the true label, has emerged as a potential threat to Machine Reading (MRC) systems. We introduce a framework that guides an editor to add potential shortcuts-triggers to samples. Using GPT4 as the editor, we find it can successfully edit trigger shortcut in samples that fool LLMs.
arXiv Detail & Related papers (2023-10-24T12:37:06Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning [28.162661418161466]
Large language models (LLMs) have recently shown great potential for in-context learning. This paper investigates the reliance of LLMs on shortcuts or spurious correlations within prompts. We uncover a surprising finding that larger models are more likely to utilize shortcuts in prompts during inference.
arXiv Detail & Related papers (2023-05-26T20:56:30Z)
Shortcut Learning of Large Language Models in Natural Language Understanding [119.45683008451698]
Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding tasks. They might rely on dataset bias and artifacts as shortcuts for prediction. This has significantly affected their generalizability and adversarial robustness.
arXiv Detail & Related papers (2022-08-25T03:51:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.