Related papers: Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

URL: http://arxiv.org/abs/2508.01554v1
Date: Sun, 03 Aug 2025 02:46:30 GMT
Title: Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models
Authors: Yujia Zheng, Tianhao Li, Haotian Huang, Tianyu Zeng, Jingyu Lu, Chuangxin Chu, Yuekai Huang, Ziyou Jiang, Qian Xiong, Yuyao Ge, Mingyang Li,
Abstract summary: We introduce PromptAnatomy, an automated framework that dissects prompts into functional components.<n>We generate adversarial examples by selectively perturbing each component using our proposed method, ComPerturb.<n>As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework.
Score: 11.625319498017733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.

Related papers

Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations [33.04242471060053]
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text.<n>No comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance.
arXiv Detail & Related papers (2025-06-20T01:46:50Z)
Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System [2.0257616108612373]
Large Language Models are vulnerable to adversarial assaults, manipulative prompts, and encoded malicious inputs.<n>This study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own.
arXiv Detail & Related papers (2025-05-02T14:42:26Z)
Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.<n>We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z)
Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations [11.566214724241798]
We propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. We focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension.
arXiv Detail & Related papers (2024-09-27T10:07:33Z)
Enhancing LLM's Cognition via Structurization [41.13997892843677]
Large language models (LLMs) process input contexts through a causal and sequential perspective. This paper presents a novel concept of context structurization. Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements.
arXiv Detail & Related papers (2024-07-23T12:33:58Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [76.18347405302728]
This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. The adversarial prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our findings demonstrate that contemporary Large Language Models are not robust to adversarial prompts.
arXiv Detail & Related papers (2023-06-07T15:37:00Z)
Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input. We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise. We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z)
Enriching Relation Extraction with OpenIE [70.52564277675056]
Relation extraction (RE) is a sub-discipline of information extraction (IE) In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE. Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models.
arXiv Detail & Related papers (2022-12-19T11:26:23Z)
Estimating the Adversarial Robustness of Attributions in Text with Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z)
BERT-ASC: Auxiliary-Sentence Construction for Implicit Aspect Learning in Sentiment Analysis [4.522719296659495]
This paper proposes a unified framework to address aspect categorization and aspect-based sentiment subtasks. We introduce a mechanism to construct an auxiliary-sentence for the implicit aspect using the corpus's semantic information. We then encourage BERT to learn aspect-specific representation in response to this auxiliary-sentence, not the aspect itself.
arXiv Detail & Related papers (2022-03-22T13:12:27Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.