Should I disclose my dataset? Caveats between reproducibility and
individual data rights
- URL: http://arxiv.org/abs/2211.00498v1
- Date: Tue, 1 Nov 2022 14:42:11 GMT
- Title: Should I disclose my dataset? Caveats between reproducibility and
individual data rights
- Authors: Raysa M. Benatti, Camila M. L. Villarroel, Sandra Avila, Esther L.
Colombini, Fabiana C. Severi
- Abstract summary: Digital availability of court documents increases possibilities for researchers.
However, personal data protection laws impose restrictions on data exposure.
We present legal and ethical considerations on the issue, as well as guidelines for researchers.
- Score: 5.816090284071069
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Natural language processing techniques have helped domain experts solve legal
problems. Digital availability of court documents increases possibilities for
researchers, who can access them as a source for building datasets -- whose
disclosure is aligned with good reproducibility practices in computational
research. Large and digitized court systems, such as the Brazilian one, are
prone to be explored in that sense. However, personal data protection laws
impose restrictions on data exposure and state principles about which
researchers should be mindful. Special caution must be taken in cases with
human rights violations, such as gender discrimination, over which we elaborate
as an example of interest. We present legal and ethical considerations on the
issue, as well as guidelines for researchers dealing with this kind of data and
deciding whether to disclose it.
Related papers
- Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice [32.550204238857724]
We propose a human-centric legal NLP pipeline, covering data sourcing, inference, and evaluation.
We release a dataset, LegalQA, with real and specific legal questions spanning from employment law to criminal law.
We show that retrieval-augmented generation from only 850 citations in the train set can match or outperform internet-wide retrieval.
arXiv Detail & Related papers (2024-09-12T02:40:28Z) - Gender Bias Detection in Court Decisions: A Brazilian Case Study [4.948270494088624]
We present an experimental framework developed to automatically detect gender biases in court decisions issued in Brazilian Portuguese.
We identify features we identify to be critical in such a technology, given its proposed use as a support tool for research and assessment of courtactivity.
arXiv Detail & Related papers (2024-06-01T10:34:15Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - Embedding Privacy in Computational Social Science and Artificial Intelligence Research [2.048226951354646]
Preserving privacy has emerged as a critical factor in research.
The increasing use of advanced computational models stands to exacerbate privacy concerns.
This article contributes to the field by discussing the role of privacy and the issues that researchers working in CSS, AI, data science and related domains are likely to face.
arXiv Detail & Related papers (2024-04-17T16:07:53Z) - SoK: The Gap Between Data Rights Ideals and Reality [46.14715472341707]
Do rights-based privacy laws effectively empower individuals over their data?
This paper scrutinizes these approaches by reviewing empirical studies, news articles, and blog posts.
arXiv Detail & Related papers (2023-12-03T21:52:51Z) - Having your Privacy Cake and Eating it Too: Platform-supported Auditing
of Social Media Algorithms for Public Interest [70.02478301291264]
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse.
Prior studies have used black-box methods to show that these algorithms can lead to biased or discriminatory outcomes.
We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation.
arXiv Detail & Related papers (2022-07-18T17:32:35Z) - Pile of Law: Learning Responsible Data Filtering from the Law and a
256GB Open-Source Legal Dataset [46.156169284961045]
We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material.
First, we gather and make available the Pile of Law, a 256GB dataset of open-source English-language legal and administrative data.
Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons.
Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data.
arXiv Detail & Related papers (2022-07-01T06:25:15Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Yes-Yes-Yes: Donation-based Peer Reviewing Data Collection for ACL
Rolling Review and Beyond [58.71736531356398]
We present an in-depth discussion of peer reviewing data, outline the ethical and legal desiderata for peer reviewing data collection, and propose the first continuous, donation-based data collection workflow.
We report on the ongoing implementation of this workflow at the ACL Rolling Review and deliver the first insights obtained with the newly collected data.
arXiv Detail & Related papers (2022-01-27T11:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.