Related papers: CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models

CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models

URL: http://arxiv.org/abs/2602.05633v1
Date: Thu, 05 Feb 2026 13:13:19 GMT
Title: CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models
Authors: Rui Jia, Ruiyi Lan, Fengrui Liu, Zhongxiang Dai, Bo Jiang, Jing Shao, Jingyuan Chen, Guandong Xu, Fei Wu, Min Zhang,
Abstract summary: We propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories.<n>This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios.
Score: 55.0103764229311
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.

Related papers

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z)
SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z)
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers [8.123835490773095]
Large Language Models for Simulating Professions (SP-LLMs) are pivotal for personalized education.<n>EduGuardBench assesses professional fidelity using a Role-playing Fidelity Score (RFS)<n>It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct.
arXiv Detail & Related papers (2025-11-10T09:42:24Z)
Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models [0.0]
This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks.<n>The framework defines three metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient ( SCC), and instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability.
arXiv Detail & Related papers (2025-11-03T14:43:56Z)
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift [33.83306492023009]
ConceptLens is a generic framework that leverages pre-trained multimodal models to identify integrity threats.<n>It uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts.<n>It uncovers sociological biases in generative content, revealing disparities across sociological contexts.
arXiv Detail & Related papers (2025-04-28T13:30:48Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation.<n>LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs.<n>This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.