From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
- URL: http://arxiv.org/abs/2506.09996v3
- Date: Mon, 22 Sep 2025 11:37:26 GMT
- Title: From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
- Authors: Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, Juan Cao,
- Abstract summary: Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output.<n>Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected.<n>We propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels.
- Score: 12.505882642773829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
Related papers
- Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Detecting Adversarial Fine-tuning with Auditing Agents [38.964973163076586]
We introduce the concept of a fine-tuning auditing agent and show it can detect harmful fine-tuning prior to model deployment.<n>We evaluate our detection approach on a diverse set of eight strong fine-tuning attacks from the literature, along with five benign fine-tuned models.<n>Most promising, the auditor is able to detect covert cipher attacks that evade safety evaluations and content moderation of the dataset.
arXiv Detail & Related papers (2025-10-17T23:01:16Z) - Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models [15.90085929279269]
Self-Sanitize is a novel LLM-driven mitigation framework inspired by cognitive psychology.<n>It emulates human self-monitor and self-repair behaviors during conversations.<n>It achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs.
arXiv Detail & Related papers (2025-09-29T08:59:44Z) - Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing [12.835224376066769]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors.<n>We introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients.<n>We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination.
arXiv Detail & Related papers (2025-09-26T12:07:47Z) - SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling [50.66950115630554]
Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs)<n>However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability.<n>We propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control capabilities.
arXiv Detail & Related papers (2025-08-12T17:32:24Z) - Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification [17.67273082468732]
Verifiers -- functions assigning rewards to agent behavior -- have been key for AI progress in domains like math and board games.<n>We evaluate Multimodal Large Language Models (MLLMs) as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation.<n>We propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning.
arXiv Detail & Related papers (2025-07-15T18:50:29Z) - GPT, But Backwards: Exactly Inverting Language Model Outputs [10.759904571495845]
We formalize exact input reconstruction as a discrete problem with a unique global minimum.<n>We introduce SODA, an efficient gradient-based algorithm that operates on a continuous relaxation of the input search space.<n>We succeed in fully recovering 79.5% of shorter out-of-distribution inputs from next-token logits, without a single false positive.<n>This suggests that standard deployment practices may currently provide adequate protection against malicious use of our method.
arXiv Detail & Related papers (2025-07-02T13:20:30Z) - Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions [60.43398881149664]
We introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LLM Output Signature.<n>It achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences [18.36319991890607]
We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences.<n>Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs.<n>The lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls.
arXiv Detail & Related papers (2025-02-12T05:48:57Z) - SOUL: A Semi-supervised Open-world continUal Learning method for Network Intrusion Detection [2.8148957592979427]
In this work, we focus on label scarcity and open-world learning (OWL) settings to improve the attack class detection of the continual learning-based network intrusion detection (NID)<n>We formulate OWL for NID as a semi-supervised continual learning-based method, dubbed SOUL, to achieve the classifier performance on par with fully supervised models while using limited annotated data.<n>The proposed method is evaluated on four standard network intrusion detection datasets.
arXiv Detail & Related papers (2024-12-01T17:57:34Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Unsupervised Continual Anomaly Detection with Contrastively-learned
Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD.
The framework equips the UAD with continual learning capability through contrastively-learned prompts.
We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z) - Weakly Supervised Detection of Hallucinations in LLM Activations [4.017261947780098]
We propose an auditing method to identify whether a large language model encodes hallucinations in its internal states.
We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns.
Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally.
arXiv Detail & Related papers (2023-12-05T14:35:11Z) - WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD)
An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images.
The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.