Related papers: Evaluating Language Model Reasoning about Confidential Information

Evaluating Language Model Reasoning about Confidential Information

URL: http://arxiv.org/abs/2508.19980v1
Date: Wed, 27 Aug 2025 15:39:46 GMT
Title: Evaluating Language Model Reasoning about Confidential Information
Authors: Dylan Sam, Alexander Robey, Andy Zou, Matt Fredrikson, J. Zico Kolter,
Abstract summary: We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
Score: 95.64687778185703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.

Related papers

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Towards eliciting latent knowledge from LLMs with mechanistic interpretability [1.3286418032136589]
This work aims to explore the ability of current techniques to elicit hidden knowledge from language models.<n>We train a Taboo model: a language model that describes a specific secret word without explicitly stating it.<n>We develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders.
arXiv Detail & Related papers (2025-05-20T13:36:37Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
Exploring the Privacy Protection Capabilities of Chinese Large Language Models [19.12726985060863]
We have devised a three-tiered progressive framework for evaluating privacy in language systems. Our primary objective is to comprehensively evaluate the sensitivity of large language models to private information. Our observations indicate that existing Chinese large language models universally show privacy protection shortcomings.
arXiv Detail & Related papers (2024-03-27T02:31:54Z)
Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models [3.9490749767170636]
Large language models (LLMs) have revolutionized text generation, translation, and question-answering tasks. Despite their widespread use, LLMs present challenges such as ethical dilemmas when models are compelled to respond inappropriately. This paper addresses these challenges by introducing a multi-pronged approach that includes: 1) filtering sensitive vocabulary from user input to prevent unethical responses; 2) detecting role-playing to halt interactions that could lead to 'prison break' scenarios; and 4) extending these methodologies to various LLM derivatives like Multi-Model Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-01-27T08:09:33Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
User-Centered Security in Natural Language Processing [0.7106986689736825]
dissertation proposes a framework of user-centered security in Natural Language Processing (NLP) It focuses on two security domains within NLP with great public interest.
arXiv Detail & Related papers (2023-01-10T22:34:19Z)
LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog. Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Why Should I Trust a Model is Private? Using Shifts in Model Explanation for Evaluating Privacy-Preserving Emotion Recognition Model [35.016050900061]
We focus on using interpretable methods to evaluate a model's efficacy to preserve privacy with respect to sensitive variables. We show how certain commonly-used methods that seek to preserve privacy might not align with human perception of privacy preservation. We conduct crowdsourcing experiments to evaluate the inclination of the evaluators to choose a particular model for a given task.
arXiv Detail & Related papers (2021-04-18T09:56:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.