Related papers: WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit Courts

WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit Courts

URL: http://arxiv.org/abs/2310.18724v1
Date: Sat, 28 Oct 2023 15:04:29 GMT
Title: WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit Courts
Authors: Elliott Ash, Naman Goel, Nianyun Li, Claudia Marangon, Peiyao Sun
Abstract summary: We contribute WCLD, a curated large dataset of 1.5 million criminal cases from circuit courts in the U.S. state of Wisconsin. We used reliable public data from 1970 to 2020 to curate attributes like prior criminal counts and recidivism outcomes.
Score: 7.415975372963897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning based decision-support tools in criminal justice systems are subjects of intense discussions and academic research. There are important open questions about the utility and fairness of such tools. Academic researchers often rely on a few small datasets that are not sufficient to empirically study various real-world aspects of these questions. In this paper, we contribute WCLD, a curated large dataset of 1.5 million criminal cases from circuit courts in the U.S. state of Wisconsin. We used reliable public data from 1970 to 2020 to curate attributes like prior criminal counts and recidivism outcomes. The dataset contains large number of samples from five racial groups, in addition to information like sex and age (at judgment and first offense). Other attributes in this dataset include neighborhood characteristics obtained from census data, detailed types of offense, charge severity, case decisions, sentence lengths, year of filing etc. We also provide pseudo-identifiers for judge, county and zipcode. The dataset will not only enable researchers to more rigorously study algorithmic fairness in the context of criminal justice, but also relate algorithmic challenges with various systemic issues. We also discuss in detail the process of constructing the dataset and provide a datasheet. The WCLD dataset is available at \url{https://clezdata.github.io/wcld/}.

Related papers

Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description. Existing works mainly focus on case-to-case retrieval using lengthy queries. Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z)
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. Data curation strategies are typically developed agnostic of the available compute for training. We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z)
LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset [20.315416393247247]
We introduce LeCaRDv2, a large-scale Legal Case Retrieval dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. We enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law.
arXiv Detail & Related papers (2023-10-26T17:32:55Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness. Existing SCR datasets only focus on the fact description section when judging the similarity between cases. We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z)
Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations. Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z)
Analyzing a Carceral Algorithm used by the Pennsylvania Department of Corrections [0.0]
This paper is focused on the Pennsylvania Additive Classification Tool (PACT) used to classify prisoners' custody levels while they are incarcerated. The algorithm in this case determines the likelihood a person would endure additional disciplinary actions, can complete required programming, and gain experiences that, among other things, are distilled into variables feeding into the parole algorithm.
arXiv Detail & Related papers (2021-12-06T18:47:31Z)
Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z)
Large image datasets: A pyrrhic win for computer vision? [2.627046865670577]
We investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets.
arXiv Detail & Related papers (2020-06-24T06:41:32Z)
Extracting Entities and Topics from News and Connecting Criminal Records [6.685013315842082]
This paper summarizes methodologies used in extracting entities and topics from a database of criminal records and from a database of newspapers. Statistical models had successfully been used in studying the topics of roughly 300,000 New York Times articles. analytical approaches, especially in hotspot mapping, were used in some researches with an aim to predict crime locations and circumstances in the future.
arXiv Detail & Related papers (2020-05-03T00:06:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.