Related papers: A Closer Look at System Prompt Robustness

A Closer Look at System Prompt Robustness

URL: http://arxiv.org/abs/2502.12197v1
Date: Sat, 15 Feb 2025 18:10:45 GMT
Title: A Closer Look at System Prompt Robustness
Authors: Norman Mu, Jonathan Lu, Michael Lavery, David Wagner,
Abstract summary: Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures.<n>In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user.<n>We create realistic new evaluation and fine-tuning datasets based on prompts collected from OpenAI's GPT Store and HuggingFace's HuggingChat.
Score: 2.5525497052179995
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.

Related papers

Epistemic Uncertainty-aware Recommendation Systems via Bayesian Deep Ensemble Learning [2.3310092106321365]
We propose an ensemble-based supermodel to generate more robust and reliable predictions. We also introduce a new interpretable non-linear matching approach for the user and item embeddings.
arXiv Detail & Related papers (2025-04-14T23:04:35Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference [56.20586932251531]
We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model.<n>Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.
arXiv Detail & Related papers (2025-02-14T08:00:42Z)
Resilience to the Flowing Unknown: an Open Set Recognition Framework for Data Streams [6.7236795813629]
This work investigates the application of an Open Set Recognition framework that combines classification and clustering to address the textitover-occupied space problem in streaming scenarios.
arXiv Detail & Related papers (2024-10-31T11:06:54Z)
Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation [51.06031200728449]
We propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy. Results observe significant performance improvement by our method, compared with several well-known baselines.
arXiv Detail & Related papers (2024-09-11T17:01:06Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning [71.8876256714229]
We propose an entity-based contrastive learning framework for improving the robustness of knowledge-grounded dialogue systems. Our method achieves new state-of-the-art performance in terms of automatic evaluation scores.
arXiv Detail & Related papers (2024-01-09T05:16:52Z)
Robustness and Generalization Performance of Deep Learning Models on Cyber-Physical Systems: A Comparative Study [71.84852429039881]
Investigation focuses on the models' ability to handle a range of perturbations, such as sensor faults and noise. We test the generalization and transfer learning capabilities of these models by exposing them to out-of-distribution (OOD) samples.
arXiv Detail & Related papers (2023-06-13T12:43:59Z)
Reliable and Interpretable Drift Detection in Streams of Short Texts [2.4603302139672008]
Data drift is one of the key factors leading to machine learning models performance degradation over time. We propose an end-to-end framework for reliable model-agnostic change-point detection and interpretation in large task-oriented dialog systems.
arXiv Detail & Related papers (2023-05-28T15:14:54Z)
Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications. It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data. We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z)
Model-Agnostic Few-Shot Open-Set Recognition [36.97433312193586]
We tackle the Few-Shot Open-Set Recognition (FSOSR) problem. We focus on developing model-agnostic inference methods that can be plugged into any existing model. We introduce an Open Set Transductive Information Maximization method OSTIM.
arXiv Detail & Related papers (2022-06-18T16:27:59Z)
WSLRec: Weakly Supervised Learning for Neural Sequential Recommendation Models [24.455665093145818]
We propose a novel model-agnostic training approach called WSLRec, which adopts a three-stage framework: pre-training, top-$k$ mining, intrinsic and fine-tuning. WSLRec resolves the incompleteness problem by pre-training models on extra weak supervisions from model-free methods like BR and ItemCF, while resolving the inaccuracy problem by leveraging the top-$k$ mining to screen out reliable user-item relevance from weak supervisions for fine-tuning.
arXiv Detail & Related papers (2022-02-28T08:55:12Z)
Modeling Online Behavior in Recommender Systems: The Importance of Temporal Context [30.894950420437926]
We show how omitting temporal context when evaluating recommender system performance leads to false confidence. We propose a training procedure to further embed the temporal context in existing models. Results show that including our temporal objective can improve recall@20 by up to 20%.
arXiv Detail & Related papers (2020-09-19T19:36:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.