Related papers: Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

URL: http://arxiv.org/abs/2311.00258v1
Date: Wed, 1 Nov 2023 03:15:05 GMT
Title: Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis
Authors: Hongyi Zheng, Abulhair Saparov
Abstract summary: We introduce a systematic approach to test the robustness of large language models (LLMs) in multi-hop reasoning tasks via domain-agnostic perturbations. We find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.
Score: 10.06218778776515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. We include perturbations at multiple levels of abstractions (e.g. lexical perturbations such as typos, and semantic perturbations such as the inclusion of intermediate reasoning steps in the questions) to conduct behavioral analysis on the LLMs. Throughout our experiments, we find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.

Related papers

Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance [33.16322104912836]
Large language models' (LLMs) reasoning is largely due to the chain-of-thought (CoT) approaches. LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. Human beings are naturally cognitive misers and will prompt language models to give rather short responses.
arXiv Detail & Related papers (2025-04-13T14:12:14Z)
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study [8.827173113748701]
We study character- and word-level edits of task-specific instructions, which substantially degrade downstream performance. We find that, on average, self-denoising achieves substantially higher performance gains than alternative strategies.
arXiv Detail & Related papers (2025-04-03T16:17:56Z)
A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning. We analyze scaling in reasoning steps that improves multi-step inference and logical consistency. We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities. Current error classification methods rely on static and predefined categories. We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting [40.78026627009521]
Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. We propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment.
arXiv Detail & Related papers (2024-10-25T18:25:35Z)
Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question. To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA) Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z)
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [77.72128397088409]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question.<n>We also propose a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z)
Resilience of Large Language Models for Noisy Instructions [38.25524275497566]
Large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. This study investigates the resilience of LLMs against five common types of disruptions including ASR (Automatic Speech Recognition) errors, OCR (Optical Character Recognition) errors, grammatical mistakes, and distractive content. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers.
arXiv Detail & Related papers (2024-04-15T12:55:08Z)
A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z)
Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning [10.86233584217013]
Previous methods fail to address reasoning errors in intermediate steps, leading to accumulative errors. We propose Deductive Beam Search (DBS), which seamlessly integrates chain-of-thought reasoning with step-wise beam search for Large Language Models. Our approach deploys a verifier, verifying the deducibility of a reasoning step and its premises, thus alleviating the error accumulation.
arXiv Detail & Related papers (2024-01-31T09:16:35Z)
MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models [64.70153487607172]
Language Models (LMs) have shown impressive performance in various natural language tasks. When it comes to natural language reasoning, LMs still face challenges such as hallucination, generating incorrect intermediate reasoning steps, and making mathematical errors. Recent research has focused on enhancing LMs through self-improvement using feedback. In this work, we propose Multi-Aspect Feedback, an iterative refinement framework that integrates multiple feedback modules, including frozen LMs and external tools, each focusing on a specific error category.
arXiv Detail & Related papers (2023-10-19T02:32:39Z)
Diversity of Thought Improves Reasoning Abilities of LLMs [26.149914503910235]
Large language models (LLMs) are documented to struggle in settings that require complex reasoning. We discuss how one can create and leverage variations of the input prompt as a means of diversity of thought.
arXiv Detail & Related papers (2023-10-11T00:01:41Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.