CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain
- URL: http://arxiv.org/abs/2402.07234v3
- Date: Thu, 21 Mar 2024 12:39:09 GMT
- Title: CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain
- Authors: Xin Tong, Bo Jin, Zhi Lin, Binjun Wang, Ting Yu, Qiang Cheng,
- Abstract summary: This study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain--CPSDbench.
CPSDbench integrates datasets related to public security collected from real-world scenarios.
This study introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security.
- Score: 21.825274494004983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated significant potential and effectiveness across multiple application domains. To assess the performance of mainstream LLMs in public security tasks, this study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain--CPSDbench. CPSDbench integrates datasets related to public security collected from real-world scenarios, supporting a comprehensive assessment of LLMs across four key dimensions: text classification, information extraction, question answering, and text generation. Furthermore, this study introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security. Through the in-depth analysis and evaluation conducted in this research, we not only enhance our understanding of the performance strengths and limitations of existing models in addressing public security issues but also provide references for the future development of more accurate and customized LLM models targeted at applications in this field.
Related papers
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models [6.002286552369069]
LalaEval aims to fill a crucial research gap by providing a systematic methodology for conducting standardized human evaluations within specific domains.
The paper demonstrates the framework's application within the logistics industry.
arXiv Detail & Related papers (2024-08-23T19:12:45Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Exploring Advanced Methodologies in Security Evaluation for LLMs [16.753146059652877]
Large Language Models (LLMs) represent an advanced evolution of earlier, simpler language models.
They boast enhanced abilities to handle complex language patterns and generate coherent text, images, audios, and videos.
Rapid expansion of LLMs has raised security and ethical concerns within the academic community.
arXiv Detail & Related papers (2024-02-28T01:32:58Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - Security and Privacy Challenges of Large Language Models: A Survey [2.6986500640871482]
Large Language Models (LLMs) have demonstrated extraordinary capabilities and contributed to multiple fields, such as generating and summarizing text, language translation, and question-answering.
These models are also vulnerable to security and privacy attacks, such as jailbreaking attacks, data poisoning attacks, and Personally Identifiable Information (PII) leakage attacks.
This survey provides a thorough review of the security and privacy challenges of LLMs for both training data and users, along with the application-based risks in various domains, such as transportation, education, and healthcare.
arXiv Detail & Related papers (2024-01-30T04:00:54Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Walking a Tightrope -- Evaluating Large Language Models in High-Risk
Domains [15.320563604087246]
High-risk domains pose unique challenges that require language models to provide accurate and safe responses.
Despite the great success of large language models (LLMs), their performance in high-risk domains remains unclear.
arXiv Detail & Related papers (2023-11-25T08:58:07Z) - Evaluating Large Language Models: A Comprehensive Survey [41.64914110226901]
Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks.
They could suffer from private data leaks or yield inappropriate, harmful, or misleading content.
To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation.
arXiv Detail & Related papers (2023-10-30T17:00:52Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.