Related papers: OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

URL: http://arxiv.org/abs/2510.01266v1
Date: Fri, 26 Sep 2025 20:14:54 GMT
Title: OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language
Authors: Isa Inuwa-Dutse,
Abstract summary: We present a summary of a set of vulnerabilities uncovered in OpenAI's GPT-OSS-20b model.<n>The core motivation for our work is to question the model's reliability for users from underrepresented communities.<n>Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

Related papers

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages [8.667909336164465]
Large language models (LLMs) are being deployed across the Global South.<n> Everyday use involves low-resource languages, code-mixing, and culturally specific norms.<n>Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
arXiv Detail & Related papers (2026-02-14T19:56:40Z)
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z)
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities [54.152681077418805]
Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalizations of model capabilities.<n>We propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities.<n>Our approach improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting.
arXiv Detail & Related papers (2025-05-29T05:25:27Z)
Language Models That Walk the Talk: A Framework for Formal Fairness Certificates [6.5301153208275675]
This work presents a holistic verification framework to certify the robustness of transformer-based language models.<n>We focus on ensuring gender fairness and consistent outputs across different gender-related terms.<n>We extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored.
arXiv Detail & Related papers (2025-05-19T06:46:17Z)
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens [26.119521867045616]
We propose augmenting the model's vocabulary with a special red flag token.<n>We train the model to insert this token whenever harmful content is generated or imminent.<n>This approach is complementary to existing safety technique.
arXiv Detail & Related papers (2025-02-22T21:48:48Z)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
Large language models (LLMs) can understand and employ deceptive behavior, even without explicit prompting.<n>We introduce "deception attacks" that undermine these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences.<n>We show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects.
arXiv Detail & Related papers (2025-02-12T11:02:59Z)
Red-Teaming for Inducing Societal Bias in Large Language Models [16.289297654694607]
We propose two bias-specific red-teaming methods to evaluate how standard safety measures for harmful content affect bias.<n>We use these attacking strategies to induce biased responses from several open- and closed-source language models.<n>We find our method increases bias in all models, even those trained with safety guardrails.
arXiv Detail & Related papers (2024-05-08T01:51:29Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. This paper introduces a training-free attack method capable of reversing safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter [2.9604738405097333]
Harmful content detection models tend to have higher false positive rates for content from marginalized groups. We propose a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. We apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content.
arXiv Detail & Related papers (2022-10-07T20:28:00Z)
LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog. Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.