PolicyQA: A Reading Comprehension Dataset for Privacy Policies
- URL: http://arxiv.org/abs/2010.02557v1
- Date: Tue, 6 Oct 2020 09:04:58 GMT
- Title: PolicyQA: A Reading Comprehension Dataset for Privacy Policies
- Authors: Wasi Uddin Ahmad and Jianfeng Chi and Yuan Tian and Kai-Wei Chang
- Abstract summary: We present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies.
We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
- Score: 77.79102359580702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Privacy policy documents are long and verbose. A question answering (QA)
system can assist users in finding the information that is relevant and
important to them. Prior studies in this domain frame the QA task as retrieving
the most relevant text segment or a list of sentences from the policy document
given a question. On the contrary, we argue that providing users with a short
text span from policy documents reduces the burden of searching the target
information from a lengthy text segment. In this paper, we present PolicyQA, a
dataset that contains 25,017 reading comprehension style examples curated from
an existing corpus of 115 website privacy policies. PolicyQA provides 714
human-annotated questions written for a wide range of privacy practices. We
evaluate two existing neural QA models and perform rigorous analysis to reveal
the advantages and challenges offered by PolicyQA.
Related papers
- InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification [60.10193972862099]
This work proposes a framework to characterize and recover simplification-induced information loss in form of question-and-answer pairs.
QA pairs are designed to help readers deepen their knowledge of a text.
arXiv Detail & Related papers (2024-01-29T19:00:01Z) - PolicyGPT: Automated Analysis of Privacy Policies with Large Language
Models [41.969546784168905]
In practical use, users tend to click the Agree button directly rather than reading them carefully.
This practice exposes users to risks of privacy leakage and legal issues.
Recently, the advent of Large Language Models (LLM) such as ChatGPT and GPT-4 has opened new possibilities for text analysis.
arXiv Detail & Related papers (2023-09-19T01:22:42Z) - Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents.
To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models.
Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z) - Discourse Comprehension: A Question Answering Framework to Represent
Sentence Connections [35.005593397252746]
A key challenge in building and evaluating models for discourse comprehension is the lack of annotated data.
This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents.
The resulting corpus, DCQA, consists of 22,430 question-answer pairs across 607 English documents.
arXiv Detail & Related papers (2021-11-01T04:50:26Z) - Privacy Policy Question Answering Assistant: A Query-Guided Extractive
Summarization Approach [18.51811191325837]
We propose an automated privacy policy question answering assistant that extracts a summary in response to the input user query.
This is a challenging task because users articulate their privacy-related questions in a very different language than the legal language of the policy.
Our pipeline is able to find an answer for 89% of the user queries in the privacyQA dataset.
arXiv Detail & Related papers (2021-09-29T18:00:09Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - Intent Classification and Slot Filling for Privacy Policies [34.606121042708864]
PolicyIE is a corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications.
We present two alternative neural approaches as baselines: (1) formulating intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence learning task.
arXiv Detail & Related papers (2021-01-01T00:44:41Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.