Related papers: Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

URL: http://arxiv.org/abs/2601.03553v1
Date: Wed, 07 Jan 2026 03:44:12 GMT
Title: Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios
Authors: Sangyub Lee, Heedou Kim, Hyeoncheol Kim,
Abstract summary: We propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process.<n>Applying this framework, we constructed a novel QA dataset from over 8,000 official documents.<n> Experimental results show that commercial LLMs struggle with our new police-related tasks.
Score: 1.111256222334957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM's responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

Related papers

Evaluating Metrics for Safety with LLM-as-Judges [1.93892819796757]
This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation.
arXiv Detail & Related papers (2025-12-17T17:24:49Z)
Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z)
Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs [21.5603664964501]
We propose a training-free and efficient method that treats policy violation detection as an out-of-distribution detection problem.<n>Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance.<n>On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models.
arXiv Detail & Related papers (2025-12-03T17:23:39Z)
Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z)
Towards AI-Driven Policing: Interdisciplinary Knowledge Discovery from Police Body-Worn Camera Footage [0.0]
We propose a novel framework for analyzing police body-worn camera (BWC) footage using advanced artificial intelligence (AI) and statistical machine learning (ML) techniques.<n>Our goal is to detect, classify, and analyze patterns of interaction between police officers and civilians to identify key behavioral dynamics.
arXiv Detail & Related papers (2025-04-28T17:25:23Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach [11.469965123352287]
This study presents an AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data.<n>Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft.<n>This framework holds the potential to transform the reporting process, ensuring greater oversight, consistency, and fairness in future policing practices.
arXiv Detail & Related papers (2025-02-11T16:27:28Z)
LAPIS: Language Model-Augmented Police Investigation System [16.579861300355343]
We introduce LAPIS (Language Model Augmented Police Investigation System), an automated system that assists police officers to perform rational and legal investigative actions. We constructed a finetuning dataset and retrieval knowledgebase specialized in crime investigation legal reasoning task. Experimental results show LAPIS' potential in providing reliable legal guidance for police officers, even better than the proprietary GPT-4 model.
arXiv Detail & Related papers (2024-07-19T09:24:29Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.