Related papers: Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

URL: http://arxiv.org/abs/2510.05864v1
Date: Tue, 07 Oct 2025 12:33:21 GMT
Title: Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
Authors: Faeze Ghorbanpour, Alexander Fraser,
Abstract summary: Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation.<n>We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens).<n>We observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit
Score: 53.19281984086319
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.

Related papers

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models [60.01080454274115]
We introduce MMLongCite, a benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios.<n> MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos.<n>Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts.
arXiv Detail & Related papers (2025-10-15T08:22:03Z)
Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z)
What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content [1.6492989697868894]
This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content.<n>Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language.
arXiv Detail & Related papers (2025-07-31T08:02:04Z)
Probing Association Biases in LLM Moderation Over-Sensitivity [42.191744175730726]
Large Language Models are widely used for content moderation but often misclassify benign comments as toxic.<n>We introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity.<n>More advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates.
arXiv Detail & Related papers (2025-05-29T18:07:48Z)
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs [19.604065692511416]
We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ)<n>Our experiments utilize context length of up to 128K tokens.<n>We find that successful attacks do not require carefully crafted harmful content.
arXiv Detail & Related papers (2025-05-26T09:57:25Z)
END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation [15.355814393928707]
We put forward a unified dataset tailored for social media content moderation across six sensitive categories.<n>These include conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam.<n>Fine-tuning large language models on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models.
arXiv Detail & Related papers (2024-11-29T16:44:02Z)
FABLES: Evaluating faithfulness and content selection in book-length summarization [55.50680057160788]
In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on book-length documents. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
arXiv Detail & Related papers (2024-04-01T17:33:38Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection [29.138463029748547]
This paper explores the capability of Large Language Models to detect implicit hate speech and express confidence in their responses. Our findings highlight that LLMs exhibit two extremes: (1) LLMs display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech.
arXiv Detail & Related papers (2024-02-18T00:04:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.