Related papers: o3-mini vs DeepSeek-R1: Which One is Safer?

o3-mini vs DeepSeek-R1: Which One is Safer?

URL: http://arxiv.org/abs/2501.18438v2
Date: Fri, 31 Jan 2025 15:39:00 GMT
Title: o3-mini vs DeepSeek-R1: Which One is Safer?
Authors: Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura,
Abstract summary: DeepSeek-R1 constitutes a turning point for the AI industry.<n>OpenAI's o3-mini model is expected to set high standards in terms of performance, safety and cost.<n>We make use of our recently released automated safety testing tool, named ASTRAL.
Score: 6.105030666773317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this technical report, we systematically assess the safety level of both DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generated and executed 1,260 test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 produces significantly more unsafe responses (12%) than OpenAI's o3-mini (1.2%).

Related papers

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants [21.35387344588118]
ASTRA is an automated system designed to uncover safety flaws in AI-driven code generation and security guidance systems.<n>ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training.
arXiv Detail & Related papers (2025-08-05T21:57:52Z)
Reasoning Models Can be Easily Hacked by Fake Reasoning Bias [59.79548223686273]
We introduce THEATER, a comprehensive benchmark to evaluate Reasoning Theater Bias (RTB)<n>We investigate six bias types including Simple Cues and Fake Chain-of-Thought.<n>We identify'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability [29.437113221903715]
We introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 models. Our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation.
arXiv Detail & Related papers (2025-04-14T10:26:37Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM. START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation [6.105030666773317]
Large Language Models (LLMs) impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation.<n>This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM.
arXiv Detail & Related papers (2025-01-29T16:36:53Z)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.16121855209246]
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.<n>DeepSeek-R1-Zero is trained via large-scale reinforcement learning.<n>DeepSeek-R1 incorporates multi-stage training and cold-start data before RL.
arXiv Detail & Related papers (2025-01-22T15:19:35Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance. We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.<n>This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z)
AI Cyber Risk Benchmark: Automated Exploitation Capabilities [0.24578723416255752]
We introduce a new benchmark for assessing AI models' capabilities and risks in automated software exploitation.<n>We evaluate several leading language models, including OpenAI's o1-preview and o1-mini, Anthropic's Claude-3.5-sonnet-20241022 and Claude-3.5-sonnet-20240620.
arXiv Detail & Related papers (2024-10-29T10:57:11Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustlocked AI systems are crucial for ensuring the safety of AI systems. Developers of AI systems may have incentives for sandbagging evaluations. We show that capability evaluations are vulnerable to sandbagging.
arXiv Detail & Related papers (2024-06-11T15:26:57Z)
Work-in-Progress: Crash Course: Can (Under Attack) Autonomous Driving Beat Human Drivers? [60.51287814584477]
This paper evaluates the inherent risks in autonomous driving by examining the current landscape of AVs. We develop specific claims highlighting the delicate balance between the advantages of AVs and potential security challenges in real-world scenarios.
arXiv Detail & Related papers (2024-05-14T09:42:21Z)
Introducing v0.5 of the AI Safety Benchmark from MLCommons [101.98401637778638]
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models.
arXiv Detail & Related papers (2024-04-18T15:01:00Z)
Software Vulnerability and Functionality Assessment using LLMs [0.8057006406834466]
We investigate whether Large Language Models (LLMs) can aid with code reviews. Our investigation focuses on two tasks that we argue are fundamental to good reviews.
arXiv Detail & Related papers (2024-03-13T11:29:13Z)
Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL. We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks. We use real-world benchmarks to cover the factors space that impacts the learning dynamics. We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.