MIRROR: Modular Internal Processing for Personalized Safety in LLM Dialogue
- URL: http://arxiv.org/abs/2506.00430v2
- Date: Fri, 03 Oct 2025 17:42:59 GMT
- Title: MIRROR: Modular Internal Processing for Personalized Safety in LLM Dialogue
- Authors: Nicole Hsing,
- Abstract summary: Large language models generate harmful recommendations in personal multi-turn dialogue by ignoring user-specific safety context.<n>We introduce MIRROR, a modular production-focused architecture that prevents these failures through a persistent, bounded internal state.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models frequently generate harmful recommendations in personal multi-turn dialogue by ignoring user-specific safety context, exhibiting sycophantic agreement, and compromising user safety for larger group preferences. We introduce MIRROR, a modular production-focused architecture that prevents these failures through a persistent, bounded internal state that preserves personal conversational information across conversational turns. Our dual-component design inspired by Dual Process Theory separates immediate response generation (Talker) from asynchronous deliberative processing (Thinker), which synthesizes parallel reasoning threads between turns with marginal latency. On the CuRaTe personalized safety benchmark, MIRROR-augmented models achieve a 21% relative improvement (69% to 84%) across seven diverse frontier models, with open-source Llama 4 and Mistral 3 variants surpassing both GPT-4o and Claude 3.7 Sonnet at only \$0.0028 to \$0.0172 additional cost per turn, narrowing the gap between affordable open-source models to frontier systems in the safety space. The modular architecture enables flexible deployment: full internal processing for affordable models or single-component configurations for expensive systems, democratizing access to safer, personalized AI.
Related papers
- MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models [17.848889547838173]
MUSE (Multimodal Unified Safety Evaluation) is an open-source, run-centric platform that integrates automatic cross-modal payload generation.<n>A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance)<n>Experiments show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal.
arXiv Detail & Related papers (2026-03-03T00:10:23Z) - Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems [51.95643874494937]
Malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains.<n>We propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors.
arXiv Detail & Related papers (2026-02-05T01:15:06Z) - AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs [30.026306656765314]
We present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples.<n>We propose AM$3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization.<n>Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate.
arXiv Detail & Related papers (2026-01-08T08:57:05Z) - OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs [36.57820295876294]
We introduce OpenRT, a unified, modular, and high- throughput red-teaming framework for MLLM safety evaluation.<n>At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five dimensions.<n>Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies.
arXiv Detail & Related papers (2026-01-04T16:41:33Z) - GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs [24.327693899810615]
We present GateBreaker, the first training-free, lightweight, and architecture-agnostic attack framework.<n>GateBreaker compromises the safety alignment of modern MoE LLMs at inference time.<n>Our study shows that MoE safety concentrates within a small subset of neurons coordinated by sparse routing.
arXiv Detail & Related papers (2025-12-24T07:13:24Z) - One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework [51.50565654314582]
Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
arXiv Detail & Related papers (2025-11-05T14:39:59Z) - DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction [14.520176577205754]
We introduce a model-agnostic two-stage Consistency Reflection and Correction framework.<n>In the consistency reflection stage, the model is prompted to reflect on the discrepancies between generated responses and dialogue contexts.<n>In the consistency correction stage, the model generates responses that are more consistent with the dialogue context.
arXiv Detail & Related papers (2025-06-16T11:15:21Z) - DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs [54.4857963044859]
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models.<n>Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach.
arXiv Detail & Related papers (2025-05-11T16:39:58Z) - From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations [11.958380211411386]
This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations.<n>We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps.<n>CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context.
arXiv Detail & Related papers (2025-03-16T15:55:29Z) - In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents [70.12342024019044]
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information limits their effectiveness.<n>We propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections.<n>RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
arXiv Detail & Related papers (2025-03-11T04:15:52Z) - LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint [42.98847958315427]
LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs.<n>$textbfL$ocates task-specific neurons via gradient-based attribution.<n>$textbfE$lects critical neurons through multi-model importance fusion.<n>$textbfD$isjoints conflicting updates through parameter isolation.
arXiv Detail & Related papers (2025-02-24T01:19:43Z) - REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations [22.000288488609733]
CauseMotion is a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion.<n>By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments.<n>A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%.<n>On the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
arXiv Detail & Related papers (2025-01-01T09:10:32Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Reasoning in Conversation: Solving Subjective Tasks through Dialogue
Simulation for Large Language Models [56.93074140619464]
We propose RiC (Reasoning in Conversation), a method that focuses on solving subjective tasks through dialogue simulation.
The motivation of RiC is to mine useful contextual information by simulating dialogues instead of supplying chain-of-thought style rationales.
We evaluate both API-based and open-source LLMs including GPT-4, ChatGPT, and OpenChat across twelve tasks.
arXiv Detail & Related papers (2024-02-27T05:37:10Z) - MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation [62.44907105496227]
MindDial is a novel conversational framework that can generate situated free-form responses with theory-of-mind modeling.
We introduce an explicit mind module that can track the speaker's belief and the speaker's prediction of the listener's belief.
Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation.
arXiv Detail & Related papers (2023-06-27T07:24:32Z) - DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z) - Coreference-aware Double-channel Attention Network for Multi-party
Dialogue Reading Comprehension [7.353227696624305]
We tackle Multi-party Dialogue Reading (abbr., MDRC)
MDRC stands for an extractive reading comprehension task grounded on a batch of dialogues among multiple interlocutors.
We propose a coreference-aware attention modeling method to strengthen the reasoning ability.
arXiv Detail & Related papers (2023-05-15T05:01:29Z) - Federated Nearest Neighbor Machine Translation [66.8765098651988]
In this paper, we propose a novel federated nearest neighbor (FedNN) machine translation framework.
FedNN leverages one-round memorization-based interaction to share knowledge across different clients.
Experiments show that FedNN significantly reduces computational and communication costs compared with FedAvg.
arXiv Detail & Related papers (2023-02-23T18:04:07Z) - Dial2vec: Self-Guided Contrastive Learning of Unsupervised Dialogue
Embeddings [41.79937481022846]
We introduce the task of learning unsupervised dialogue embeddings.
Trivial approaches such as combining pre-trained word or sentence embeddings and encoding through pre-trained language models have been shown to be feasible.
We propose a self-guided contrastive learning approach named dial2vec.
arXiv Detail & Related papers (2022-10-27T11:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.