Related papers: Evaluating the Effectiveness of OpenAI's Parental Control System

Evaluating the Effectiveness of OpenAI's Parental Control System

URL: http://arxiv.org/abs/2601.23062v1
Date: Fri, 30 Jan 2026 15:15:24 GMT
Title: Evaluating the Effectiveness of OpenAI's Parental Control System
Authors: Kerem Ersoz, Saleh Afroogh, David Atkinson, Junfeng Jiao,
Abstract summary: We evaluate how effectively platform-level parental controls moderate a mainstream conversational assistant used by minors.<n>We focus on seven risk areas -- physical harm, pornography, privacy violence, health consultation, fraud, hate speech, and malware.<n>We find that notifications are selective rather than comprehensive.
Score: 1.6961535626222226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluate how effectively platform-level parental controls moderate a mainstream conversational assistant used by minors. Our two-phase protocol first builds a category-balanced conversation corpus via PAIR-style iterative prompt refinement over API, then has trained human agents replay/refine those prompts in the consumer UI using a designated child account while monitoring the linked parent inbox for alerts. We focus on seven risk areas -- physical harm, pornography, privacy violence, health consultation, fraud, hate speech, and malware and quantify four outcomes: Notification Rate (NR), Leak-Through (LR), Overblocking (OBR), and UI Intervention Rate (UIR). Using an automated judge (with targeted human audit) and comparing the current backend to legacy variants (GPT-4.1/4o), we find that notifications are selective rather than comprehensive: privacy violence, fraud, hate speech, and malware triggered no parental alerts in our runs, whereas physical harm (highest), pornography, and some health queries produced intermittent alerts. The current backend shows lower leak-through than legacy models, yet overblocking of benign, educational queries near sensitive topics remains common and is not surfaced to parents, revealing a policy-product gap between on-screen safeguards and parent-facing telemetry. We propose actionable fixes: broaden/configure the notification taxonomy, couple visible safeguards to privacy-preserving parent summaries, and prefer calibrated, age-appropriate safe rewrites over blanket refusals.

Related papers

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs [61.15237978606501]
Large language models can infer private user attributes from user-generated text.<n>Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements.<n>We propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS)
arXiv Detail & Related papers (2026-02-12T03:37:50Z)
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible [12.742325129012576]
Mobile Graphical User Interface (GUI) agents have demonstrated strong capabilities in automating complex smartphone tasks.<n>We propose anonymization-based privacy protection framework that enforces the principle of available-but-invisible access to sensitive data.<n>Our system detects sensitive UI content using a PII-aware recognition model and replaces it with deterministic, type-preserving placeholders.
arXiv Detail & Related papers (2026-02-08T15:50:04Z)
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b [0.0]
We study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior.<n>We test several harm domains including ZIP-bomb construction (cyber threat)<n>We find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader.
arXiv Detail & Related papers (2025-09-25T07:00:12Z)
VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks [51.68795949691009]
We introduce VoxGuard, a framework grounded in differential privacy and membership inference.<n>For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization.<n>Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation.
arXiv Detail & Related papers (2025-09-22T20:57:48Z)
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [60.78202583483591]
We introduce OS-Harm, a new benchmark for measuring safety of computer use agents.<n> OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior.<n>We evaluate computer use agents based on a range of frontier models and provide insights into their safety.
arXiv Detail & Related papers (2025-06-17T17:59:31Z)
Predictive Response Optimization: Using Reinforcement Learning to Fight Online Social Network Abuse [8.156427899556252]
We argue that detection as described in previous work is not the goal of those who are fighting OSN abuse.<n>Rather, we believe the goal to be selecting actions that optimize a tradeoff between harm caused by abuse and impact on benign users.
arXiv Detail & Related papers (2025-02-24T22:30:14Z)
Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z)
Facilitating NSFW Text Detection in Open-Domain Dialogue Systems via Knowledge Distillation [26.443929802292807]
CensorChat is a dialogue monitoring dataset aimed at NSFW dialogue detection. This dataset offers a cost-effective means of constructing NSFW content detectors. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.
arXiv Detail & Related papers (2023-09-18T13:24:44Z)
Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
SPAct: Self-supervised Privacy Preservation for Action Recognition [73.79886509500409]
Existing approaches for mitigating privacy leakage in action recognition require privacy labels along with the action labels from the video dataset. Recent developments of self-supervised learning (SSL) have unleashed the untapped potential of the unlabeled data. We present a novel training framework which removes privacy information from input video in a self-supervised manner without requiring privacy labels.
arXiv Detail & Related papers (2022-03-29T02:56:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.