Reasoning over Public and Private Data in Retrieval-Based Systems
- URL: http://arxiv.org/abs/2203.11027v1
- Date: Mon, 14 Mar 2022 13:08:51 GMT
- Title: Reasoning over Public and Private Data in Retrieval-Based Systems
- Authors: Simran Arora and Patrick Lewis and Angela Fan and Jacob Kahn and
Christopher R\'e
- Abstract summary: State-of-the-art systems explicitly retrieve relevant information to a user question from a background corpus before producing an answer.
While today's retrieval systems assume the corpus is fully accessible, users are often unable or unwilling to expose their private data to entities hosting public data.
We first define the PUBLIC-PRIVATE AUTOREGRESSIVE Information RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes.
- Score: 29.515915401413334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Users and organizations are generating ever-increasing amounts of private
data from a wide range of sources. Incorporating private data is important to
personalize open-domain applications such as question-answering, fact-checking,
and personal assistants. State-of-the-art systems for these tasks explicitly
retrieve relevant information to a user question from a background corpus
before producing an answer. While today's retrieval systems assume the corpus
is fully accessible, users are often unable or unwilling to expose their
private data to entities hosting public data. We first define the
PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR) privacy framework
for the novel retrieval setting over multiple privacy scopes. We then argue
that an adequate benchmark is missing to study PAIR since existing textual
benchmarks require retrieving from a single data distribution. However, public
and private data intuitively reflect different distributions, motivating us to
create ConcurrentQA, the first textual QA benchmark to require concurrent
retrieval over multiple data-distributions. Finally, we show that existing
systems face large privacy vs. performance tradeoffs when applied to our
proposed retrieval setting and investigate how to mitigate these tradeoffs.
Related papers
- Towards Split Learning-based Privacy-Preserving Record Linkage [49.1574468325115]
Split Learning has been introduced to facilitate applications where user data privacy is a requirement.
In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching.
arXiv Detail & Related papers (2024-09-02T09:17:05Z) - Differentially Private Data Release on Graphs: Inefficiencies and Unfairness [48.96399034594329]
This paper characterizes the impact of Differential Privacy on bias and unfairness in the context of releasing information about networks.
We consider a network release problem where the network structure is known to all, but the weights on edges must be released privately.
Our work provides theoretical foundations and empirical evidence into the bias and unfairness arising due to privacy in these networked decision problems.
arXiv Detail & Related papers (2024-08-08T08:37:37Z) - Private Approximate Query over Horizontal Data Federation [0.0]
Existing approaches rely on cryptography, which improves privacy, but at the expense of query response time.
We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries.
Our solution is able of providing up to 8 times faster processing than the basic non-secure solution.
arXiv Detail & Related papers (2024-06-17T11:19:58Z) - Privacy-Enhanced Database Synthesis for Benchmark Publishing [16.807486872855534]
Differential privacy has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or classification tasks.
This paper delves into the creation of privacy-preserving databases specifically for benchmarking, aiming to produce a differentially private database.
PrivBench uses sum-product networks (SPNs) to partition and sample data, enhancing data representation while securing privacy.
arXiv Detail & Related papers (2024-05-02T14:20:24Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework [6.828884629694705]
This article proposes the conceptual model called PrivChatGPT, a privacy-generative model for LLMs.
PrivChatGPT consists of two main components i.e., preserving user privacy during the data curation/pre-processing together with preserving private context and the private training process for large-scale data.
arXiv Detail & Related papers (2023-10-19T06:55:13Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining [75.25943383604266]
We question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving.
We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy.
We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
arXiv Detail & Related papers (2022-12-13T10:41:12Z) - Efficient User-Centric Privacy-Friendly and Flexible Wearable Data Aggregation and Sharing [9.532148238768213]
Wearable devices can offer services to individuals and the public.
Wearable data collected by cloud providers may pose privacy risks.
We propose a novel, efficient, user-centric, privacy-friendly, and flexible data aggregation and sharing scheme, named SAMA.
arXiv Detail & Related papers (2022-03-01T13:51:52Z) - Post-processing of Differentially Private Data: A Fairness Perspective [53.29035917495491]
This paper shows that post-processing causes disparate impacts on individuals or groups.
It analyzes two critical settings: the release of differentially private datasets and the use of such private datasets for downstream decisions.
It proposes a novel post-processing mechanism that is (approximately) optimal under different fairness metrics.
arXiv Detail & Related papers (2022-01-24T02:45:03Z) - Prioritized Multi-Criteria Federated Learning [16.35440946424973]
In Machine Learning scenarios, privacy is a crucial concern when models have to be trained with private data coming from users of a service.
We propose Federated Learning (FL) as a means to build ML models based on private datasets distributed over a large number of clients.
A central coordinating server receives locally computed updates by clients and aggregate them to obtain a better global model.
arXiv Detail & Related papers (2020-07-17T10:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.