NeuroFilter: Privacy Guardrails for Conversational LLM Agents
- URL: http://arxiv.org/abs/2601.14660v1
- Date: Wed, 21 Jan 2026 05:16:50 GMT
- Title: NeuroFilter: Privacy Guardrails for Conversational LLM Agents
- Authors: Saswat Das, Ferdinando Fioretto,
- Abstract summary: This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs)<n>NeuroFilter is a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space.<n>A comprehensive evaluation across over 150,000 interactions, covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter.
- Score: 50.75206727081996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the contextual integrity framework. Indeed, existing defenses rely on LLM-mediated checking stages that add substantial latency and cost, and that can be undermined in multi-turn interactions through manipulation or benign-looking conversational scaffolding. Contrasting this background, this paper makes a key observation: internal representations associated with privacy-violating intent can be separated from benign requests using linear structure. Using this insight, the paper proposes NeuroFilter, a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space, enabling detection even when semantic filters are bypassed. The proposed filter is also extended to capture threats arising during long conversations using the concept of activation velocity, which measures cumulative drift in internal representations across turns. A comprehensive evaluation across over 150,000 interactions and covering models from 7B to 70B parameters, illustrates the strong performance of NeuroFilter in detecting privacy attacks while maintaining zero false positives on benign prompts, all while reducing the computational inference cost by several orders of magnitude when compared to LLM-based agentic privacy defenses.
Related papers
- SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought [8.165127822088499]
Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data.<n>Recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations.<n>We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought.
arXiv Detail & Related papers (2025-11-11T02:45:48Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration [72.33801123508145]
Large language models (LLMs) are integral to multi-agent systems.<n>Privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations.<n>In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information.
arXiv Detail & Related papers (2025-09-16T16:57:25Z) - Searching for Privacy Risks in LLM Agents via Simulation [61.229785851581504]
We present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions.<n>We find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery.<n>The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.
arXiv Detail & Related papers (2025-08-14T17:49:09Z) - Quantifying Conversation Drift in MCP via Latent Polytope [12.004235167472238]
The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools.<n> adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration.<n>We propose SecMCP, a framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge.
arXiv Detail & Related papers (2025-08-08T16:05:27Z) - Convergent Privacy Framework with Contractive GNN Layers for Multi-hop Aggregations [9.399260063250635]
Differential privacy (DP) has been integrated into graph neural networks (GNNs) to protect sensitive structural information.<n>We propose a simple yet effective Contractive Graph Layer (CGL) that ensures the contractiveness required for theoretical guarantees.<n>Our framework, CARIBOU, supports both training and inference, equipped with a contractive aggregation module, a privacy allocation module, and a privacy auditing module.
arXiv Detail & Related papers (2025-06-28T02:17:53Z) - MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation [54.410825977390274]
Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks.<n>We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains.<n>We then evaluate the current state-of-the-art LLMs on their understanding of contextually private data and their ability to collaborate without violating user privacy.
arXiv Detail & Related papers (2025-06-25T18:04:25Z) - Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents [43.303548143175256]
This study proposes an auditing framework for conversational privacy that quantifies an agent's susceptibility to risks.<n>The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework is designed to stress-test agents that enforce strict privacy directives.
arXiv Detail & Related papers (2025-06-11T20:47:37Z) - Urania: Differentially Private Insights into AI Use [102.27238986985698]
$Urania$ provides end-to-end privacy protection by leveraging DP tools such as clustering, partition selection, and histogram-based summarization.<n>Results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy.
arXiv Detail & Related papers (2025-06-05T07:00:31Z) - Noisy Neighbors: Efficient membership inference attacks against LLMs [2.666596421430287]
This paper introduces an efficient methodology that generates textitnoisy neighbors for a target sample by adding noise in the embedding space.
Our findings demonstrate that this approach closely matches the effectiveness of employing shadow models, showing its usability in practical privacy auditing scenarios.
arXiv Detail & Related papers (2024-06-24T12:02:20Z) - Over-the-Air Federated Learning with Privacy Protection via Correlated
Additive Perturbations [57.20885629270732]
We consider privacy aspects of wireless federated learning with Over-the-Air (OtA) transmission of gradient updates from multiple users/agents to an edge server.
Traditional perturbation-based methods provide privacy protection while sacrificing the training accuracy.
In this work, we aim at minimizing privacy leakage to the adversary and the degradation of model accuracy at the edge server.
arXiv Detail & Related papers (2022-10-05T13:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.