Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
- URL: http://arxiv.org/abs/2503.15092v1
- Date: Wed, 19 Mar 2025 10:44:37 GMT
- Title: Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
- Authors: Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, Dacheng Tao,
- Abstract summary: This study presents the first comprehensive safety evaluation of the DeepSeek models.<n>Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models.
- Score: 51.65890794988425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at https://github.com/NY1024/DeepSeek-Safety-Eval.
Related papers
- Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts [11.573196818552649]
DeepSeek-R1 is renowned for its exceptional reasoning capabilities and open-source strategy.
DeepSeek-R1 achieves a 100% attack success rate when processing harmful prompts.
arXiv Detail & Related papers (2025-03-18T08:38:10Z) - Safety Evaluation of DeepSeek Models in Chinese Contexts [12.297396865203973]
This study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark.<n>This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts.<n>The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements.
arXiv Detail & Related papers (2025-02-16T14:05:54Z) - Safety at Scale: A Comprehensive Survey of Large Model Safety [298.05093528230753]
We present a comprehensive taxonomy of safety threats to large models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats.<n>We identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices.
arXiv Detail & Related papers (2025-02-02T05:14:22Z) - Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation.<n>LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs.<n>This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models [7.054112690519648]
CHiSafetyBench is a safety benchmark for evaluating large language models' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts.
This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively.
Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities.
arXiv Detail & Related papers (2024-06-14T06:47:40Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Towards Safer Generative Language Models: A Survey on Safety Risks,
Evaluations, and Improvements [76.80453043969209]
This survey presents a framework for safety research pertaining to large models.
We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models.
We explore the strategies for enhancing large model safety from training to deployment.
arXiv Detail & Related papers (2023-02-18T09:32:55Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.