Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies
- URL: http://arxiv.org/abs/2204.08952v3
- Date: Sat, 22 Apr 2023 05:21:45 GMT
- Title: Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies
- Authors: Md Rizwan Parvez, Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei
Chang
- Abstract summary: We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents.
To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models.
Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
- Score: 74.01792675564218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior studies in privacy policies frame the question answering (QA) task as
identifying the most relevant text segment or a list of sentences from a policy
document given a user query. Existing labeled datasets are heavily imbalanced
(only a few relevant segments), limiting the QA performance in this domain. In
this paper, we develop a data augmentation framework based on ensembling
retriever models that captures the relevant text segments from unlabeled policy
documents and expand the positive examples in the training set. In addition, to
improve the diversity and quality of the augmented data, we leverage multiple
pre-trained language models (LMs) and cascade them with noise reduction filter
models. Using our augmented data on the PrivacyQA benchmark, we elevate the
existing baseline by a large margin (10\% F1) and achieve a new
state-of-the-art F1 score of 50\%. Our ablation studies provide further
insights into the effectiveness of our approach.
Related papers
- Structured List-Grounded Question Answering [11.109829342410265]
Document-grounded dialogue systems aim to answer user queries by leveraging external information.
Previous studies have mainly focused on handling free-form documents, often overlooking structured data such as lists.
This paper aims to enhance question answering systems for better interpretation and use of structured lists.
arXiv Detail & Related papers (2024-10-04T22:21:43Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Improving Attributed Text Generation of Large Language Models via Preference Learning [28.09715554543885]
We model the attribution task as preference learning and introduce an Automatic Preference Optimization framework.
APO achieves state-of-the-art citation F1 with higher answer quality.
arXiv Detail & Related papers (2024-03-27T09:19:13Z) - MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering [64.6741991162092]
We present MinPrompt, a minimal data augmentation framework for open-domain question answering.
We transform the raw text into a graph structure to build connections between different factual sentences.
We then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text.
We generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model.
arXiv Detail & Related papers (2023-10-08T04:44:36Z) - Intermediate Training on Question Answering Datasets Improves Generative
Data Augmentation [32.83012699501051]
We improve generative data augmentation by formulating the data generation as context generation task.
We cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain.
We demonstrate substantial improvements in performance in few-shot, zero-shot settings.
arXiv Detail & Related papers (2022-05-25T09:28:21Z) - PolicyQA: A Reading Comprehension Dataset for Privacy Policies [77.79102359580702]
We present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies.
We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
arXiv Detail & Related papers (2020-10-06T09:04:58Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.