APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis
- URL: http://arxiv.org/abs/2005.06945v1
- Date: Thu, 14 May 2020 13:25:11 GMT
- Title: APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis
- Authors: Shuang Liu and Renjie Guo and Baiyang Zhao and Tao Chen and Meishan
Zhang
- Abstract summary: In this work we create a manually labelled corpus containing $167$ privacy policies.
We report the annotation process and details of the annotated corpus.
We benchmark our data corpus with $4$ document classification models, thoroughly analyze the results and discuss challenges and opportunities for the research committee to use the corpus.
- Score: 16.618995752616296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing popularity of mobile devices and the wide adoption of
mobile Apps, an increasing concern of privacy issues is raised. Privacy policy
is identified as a proper medium to indicate the legal terms, such as GDPR, and
to bind legal agreement between service providers and users. However, privacy
policies are usually long and vague for end users to read and understand. It is
thus important to be able to automatically analyze the document structures of
privacy policies to assist user understanding. In this work we create a
manually labelled corpus containing $167$ privacy policies (of more than $447$K
words and $5,276$ annotated paragraphs). We report the annotation process and
details of the annotated corpus. We also benchmark our data corpus with $4$
document classification models, thoroughly analyze the results and discuss
challenges and opportunities for the research committee to use the corpus. We
release our labelled corpus as well as the classification models for public
access.
Related papers
- Differential Privacy Overview and Fundamental Techniques [63.0409690498569]
This chapter is meant to be part of the book "Differential Privacy in Artificial Intelligence: From Theory to Practice"
It starts by illustrating various attempts to protect data privacy, emphasizing where and why they failed.
It then defines the key actors, tasks, and scopes that make up the domain of privacy-preserving data analysis.
arXiv Detail & Related papers (2024-11-07T13:52:11Z) - EROS: Entity-Driven Controlled Policy Document Summarization [16.661448437719464]
We propose to enhance the interpretability and readability of policy documents by using controlled abstractive summarization.
We develop PD-Sum, a policy-document summarization dataset with marked privacy-related entity labels.
Our proposed model, EROS, identifies critical entities through a span-based entity extraction model and employs them to control the information content of the summaries.
arXiv Detail & Related papers (2024-02-29T21:44:50Z) - Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory [82.7042006247124]
We show that even the most capable AI models reveal private information in contexts that humans would not, 39% and 57% of the time, respectively.
Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
arXiv Detail & Related papers (2023-10-27T04:15:30Z) - SeePrivacy: Automated Contextual Privacy Policy Generation for Mobile
Applications [21.186902172367173]
SeePrivacy is designed to automatically generate contextual privacy policies for mobile apps.
Our method synergistically combines mobile GUI understanding and privacy policy document analysis.
96% of the retrieved policy segments can be correctly matched with their contexts.
arXiv Detail & Related papers (2023-07-04T12:52:45Z) - Leveraging Large Language Models for Topic Classification in the Domain
of Public Affairs [65.9077733300329]
Large Language Models (LLMs) have the potential to greatly enhance the analysis of public affairs documents.
LLMs can be of great use to process domain-specific documents, such as those in the domain of public affairs.
arXiv Detail & Related papers (2023-06-05T13:35:01Z) - PLUE: Language Understanding Evaluation Benchmark for Privacy Policies
in English [77.79102359580702]
We introduce the Privacy Policy Language Understanding Evaluation benchmark, a multi-task benchmark for evaluating the privacy policy language understanding.
We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training.
We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks.
arXiv Detail & Related papers (2022-12-20T05:58:32Z) - The Text Anonymization Benchmark (TAB): A Dedicated Corpus and
Evaluation Framework for Text Anonymization [2.9849405664643585]
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods.
Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources.
This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage.
arXiv Detail & Related papers (2022-01-25T14:34:42Z) - \textit{StateCensusLaws.org}: A Web Application for Consuming and
Annotating Legal Discourse Learning [89.77347919191774]
We create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text.
We focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government.
arXiv Detail & Related papers (2021-04-20T22:00:54Z) - Intent Classification and Slot Filling for Privacy Policies [34.606121042708864]
PolicyIE is a corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications.
We present two alternative neural approaches as baselines: (1) formulating intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence learning task.
arXiv Detail & Related papers (2021-01-01T00:44:41Z) - PolicyQA: A Reading Comprehension Dataset for Privacy Policies [77.79102359580702]
We present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies.
We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
arXiv Detail & Related papers (2020-10-06T09:04:58Z) - Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies [13.09699710197036]
We create PrivaSeer, a corpus of over one million English language website privacy policies.
We show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.
arXiv Detail & Related papers (2020-04-23T13:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.