WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit
Courts
- URL: http://arxiv.org/abs/2310.18724v1
- Date: Sat, 28 Oct 2023 15:04:29 GMT
- Title: WCLD: Curated Large Dataset of Criminal Cases from Wisconsin Circuit
Courts
- Authors: Elliott Ash, Naman Goel, Nianyun Li, Claudia Marangon, Peiyao Sun
- Abstract summary: We contribute WCLD, a curated large dataset of 1.5 million criminal cases from circuit courts in the U.S. state of Wisconsin.
We used reliable public data from 1970 to 2020 to curate attributes like prior criminal counts and recidivism outcomes.
- Score: 7.415975372963897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning based decision-support tools in criminal justice systems are
subjects of intense discussions and academic research. There are important open
questions about the utility and fairness of such tools. Academic researchers
often rely on a few small datasets that are not sufficient to empirically study
various real-world aspects of these questions. In this paper, we contribute
WCLD, a curated large dataset of 1.5 million criminal cases from circuit courts
in the U.S. state of Wisconsin. We used reliable public data from 1970 to 2020
to curate attributes like prior criminal counts and recidivism outcomes. The
dataset contains large number of samples from five racial groups, in addition
to information like sex and age (at judgment and first offense). Other
attributes in this dataset include neighborhood characteristics obtained from
census data, detailed types of offense, charge severity, case decisions,
sentence lengths, year of filing etc. We also provide pseudo-identifiers for
judge, county and zipcode. The dataset will not only enable researchers to more
rigorously study algorithmic fairness in the context of criminal justice, but
also relate algorithmic challenges with various systemic issues. We also
discuss in detail the process of constructing the dataset and provide a
datasheet. The WCLD dataset is available at
\url{https://clezdata.github.io/wcld/}.
Related papers
- Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset [20.315416393247247]
We introduce LeCaRDv2, a large-scale Legal Case Retrieval dataset (version 2).
It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents.
We enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure.
It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law.
arXiv Detail & Related papers (2023-10-26T17:32:55Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Analyzing a Carceral Algorithm used by the Pennsylvania Department of
Corrections [0.0]
This paper is focused on the Pennsylvania Additive Classification Tool (PACT) used to classify prisoners' custody levels while they are incarcerated.
The algorithm in this case determines the likelihood a person would endure additional disciplinary actions, can complete required programming, and gain experiences that, among other things, are distilled into variables feeding into the parole algorithm.
arXiv Detail & Related papers (2021-12-06T18:47:31Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z) - Large image datasets: A pyrrhic win for computer vision? [2.627046865670577]
We investigate problematic practices and consequences of large scale vision datasets.
We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets.
arXiv Detail & Related papers (2020-06-24T06:41:32Z) - Extracting Entities and Topics from News and Connecting Criminal Records [6.685013315842082]
This paper summarizes methodologies used in extracting entities and topics from a database of criminal records and from a database of newspapers.
Statistical models had successfully been used in studying the topics of roughly 300,000 New York Times articles.
analytical approaches, especially in hotspot mapping, were used in some researches with an aim to predict crime locations and circumstances in the future.
arXiv Detail & Related papers (2020-05-03T00:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.