Towards Automatic Comparison of Data Privacy Documents: A Preliminary
Experiment on GDPR-like Laws
- URL: http://arxiv.org/abs/2105.10117v1
- Date: Fri, 21 May 2021 03:59:29 GMT
- Title: Towards Automatic Comparison of Data Privacy Documents: A Preliminary
Experiment on GDPR-like Laws
- Authors: Kornraphop Kawintiranon and Yaguang Liu
- Abstract summary: General Data Protection Regulation (NLP) becomes standard law for protection in many countries.
12 countries adopt their similarities-like regulations, but evaluating differences is time-consuming and needs manual effort from legal experts.
In this paper, we investigate a simple natural language processing (NLP) approach to tackle the problem.
- Score: 1.3537117504260623
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: General Data Protection Regulation (GDPR) becomes a standard law for data
protection in many countries. Currently, twelve countries adopt the regulation
and establish their GDPR-like regulation. However, to evaluate the differences
and similarities of these GDPR-like regulations is time-consuming and needs a
lot of manual effort from legal experts. Moreover, GDPR-like regulations from
different countries are written in their languages leading to a more difficult
task since legal experts who know both languages are essential. In this paper,
we investigate a simple natural language processing (NLP) approach to tackle
the problem. We first extract chunks of information from GDPR-like documents
and form structured data from natural language. Next, we use NLP methods to
compare documents to measure their similarity. Finally, we manually label a
small set of data to evaluate our approach. The empirical result shows that the
BERT model with cosine similarity outperforms other baselines. Our data and
code are publicly available.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service [0.6240153531166704]
Our work seeks to alleviate this issue by developing language models that provide automated, accessible summaries and scores for such documents.
We compared transformer-based and conventional models during training on our dataset, and RoBERTa performed better overall with a remarkable 0.74 F1-score.
arXiv Detail & Related papers (2024-04-17T19:53:59Z) - Towards an Enforceable GDPR Specification [49.1574468325115]
Privacy by Design (PbD) is prescribed by modern privacy regulations such as the EU's.
One emerging technique to realize PbD is enforcement (RE)
We present a set of requirements and an iterative methodology for creating formal specifications of legal provisions.
arXiv Detail & Related papers (2024-02-27T09:38:51Z) - Identification of Regulatory Requirements Relevant to Business
Processes: A Comparative Study on Generative AI, Embedding-based Ranking,
Crowd and Expert-driven Methods [10.899912290518648]
This work examines how legal and domain experts can be assisted in the assessment of relevant requirements.
We compare an embedding-based NLP ranking method, a generative AI method using GPT-4, and a crowdsourced method with the purely manual method of creating labels by experts.
A gold standard is created for both BPMN2.0 processes and matched to real-world requirements from multiple regulatory documents.
arXiv Detail & Related papers (2024-01-02T12:08:31Z) - MUSER: A Multi-View Similar Case Retrieval Dataset [65.36779942237357]
Similar case retrieval (SCR) is a representative legal AI application that plays a pivotal role in promoting judicial fairness.
Existing SCR datasets only focus on the fact description section when judging the similarity between cases.
We present M, a similar case retrieval dataset based on multi-view similarity measurement and comprehensive legal element with sentence-level legal element annotations.
arXiv Detail & Related papers (2023-10-24T08:17:11Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - Analysing similarities between legal court documents using natural
language processing approaches based on Transformers [0.0]
This work targets the problem of detecting the degree of similarity between judicial documents that can be achieved in the inference group.
It applies six NLP techniques based on the transformers architecture to a case study of legal proceedings in the Brazilian judicial system.
arXiv Detail & Related papers (2022-04-14T18:25:56Z) - Regulatory Compliance through Doc2Doc Information Retrieval: A case
study in EU/UK legislation where text similarity has limitations [6.40476282000118]
REG-IR is an application of document-to-document information retrieval.
We show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR.
We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels.
arXiv Detail & Related papers (2021-01-26T11:38:15Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.