Related papers: Reasoning over Public and Private Data in Retrieval-Based Systems

Reasoning over Public and Private Data in Retrieval-Based Systems

URL: http://arxiv.org/abs/2203.11027v1
Date: Mon, 14 Mar 2022 13:08:51 GMT
Title: Reasoning over Public and Private Data in Retrieval-Based Systems
Authors: Simran Arora and Patrick Lewis and Angela Fan and Jacob Kahn and Christopher R\'e
Abstract summary: State-of-the-art systems explicitly retrieve relevant information to a user question from a background corpus before producing an answer. While today's retrieval systems assume the corpus is fully accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We first define the PUBLIC-PRIVATE AUTOREGRESSIVE Information RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes.
Score: 29.515915401413334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Users and organizations are generating ever-increasing amounts of private data from a wide range of sources. Incorporating private data is important to personalize open-domain applications such as question-answering, fact-checking, and personal assistants. State-of-the-art systems for these tasks explicitly retrieve relevant information to a user question from a background corpus before producing an answer. While today's retrieval systems assume the corpus is fully accessible, users are often unable or unwilling to expose their private data to entities hosting public data. We first define the PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes. We then argue that an adequate benchmark is missing to study PAIR since existing textual benchmarks require retrieving from a single data distribution. However, public and private data intuitively reflect different distributions, motivating us to create ConcurrentQA, the first textual QA benchmark to require concurrent retrieval over multiple data-distributions. Finally, we show that existing systems face large privacy vs. performance tradeoffs when applied to our proposed retrieval setting and investigate how to mitigate these tradeoffs.

Related papers

Differentially Private Synthetic Data Release for Topics API Outputs [63.79476766779742]
We focus on one Privacy-Preserving Ads API: the Topics API, part of Google Chrome's Privacy Sandbox.<n>We generate a differentially-private dataset that closely matches the re-identification risk properties of the real Topics API data.<n>We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset.
arXiv Detail & Related papers (2025-06-30T13:46:57Z)
MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation [54.410825977390274]
Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks.<n>We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains.<n>We then evaluate the current state-of-the-art LLMs on their understanding of contextually private data and their ability to collaborate without violating user privacy.
arXiv Detail & Related papers (2025-06-25T18:04:25Z)
Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG) FedE4RAG facilitates collaborative training of client-side RAG retrieval models. We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z)
Towards Split Learning-based Privacy-Preserving Record Linkage [49.1574468325115]
Split Learning has been introduced to facilitate applications where user data privacy is a requirement. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching.
arXiv Detail & Related papers (2024-09-02T09:17:05Z)
Differentially Private Data Release on Graphs: Inefficiencies and Unfairness [48.96399034594329]
This paper characterizes the impact of Differential Privacy on bias and unfairness in the context of releasing information about networks. We consider a network release problem where the network structure is known to all, but the weights on edges must be released privately. Our work provides theoretical foundations and empirical evidence into the bias and unfairness arising due to privacy in these networked decision problems.
arXiv Detail & Related papers (2024-08-08T08:37:37Z)
Private Approximate Query over Horizontal Data Federation [0.0]
Existing approaches rely on cryptography, which improves privacy, but at the expense of query response time. We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries. Our solution is able of providing up to 8 times faster processing than the basic non-secure solution.
arXiv Detail & Related papers (2024-06-17T11:19:58Z)
Privacy-Enhanced Database Synthesis for Benchmark Publishing [16.807486872855534]
Differential privacy has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or classification tasks. This paper delves into the creation of privacy-preserving databases specifically for benchmarking, aiming to produce a differentially private database. PrivBench uses sum-product networks (SPNs) to partition and sample data, enhancing data representation while securing privacy.
arXiv Detail & Related papers (2024-05-02T14:20:24Z)
Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework [6.828884629694705]
This article proposes the conceptual model called PrivChatGPT, a privacy-generative model for LLMs. PrivChatGPT consists of two main components i.e., preserving user privacy during the data curation/pre-processing together with preserving private context and the private training process for large-scale data.
arXiv Detail & Related papers (2023-10-19T06:55:13Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining [75.25943383604266]
We question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
arXiv Detail & Related papers (2022-12-13T10:41:12Z)
Efficient User-Centric Privacy-Friendly and Flexible Wearable Data Aggregation and Sharing [9.532148238768213]
Wearable devices can offer services to individuals and the public. Wearable data collected by cloud providers may pose privacy risks. We propose a novel, efficient, user-centric, privacy-friendly, and flexible data aggregation and sharing scheme, named SAMA.
arXiv Detail & Related papers (2022-03-01T13:51:52Z)
Post-processing of Differentially Private Data: A Fairness Perspective [53.29035917495491]
This paper shows that post-processing causes disparate impacts on individuals or groups. It analyzes two critical settings: the release of differentially private datasets and the use of such private datasets for downstream decisions. It proposes a novel post-processing mechanism that is (approximately) optimal under different fairness metrics.
arXiv Detail & Related papers (2022-01-24T02:45:03Z)
Decision Making with Differential Privacy under a Fairness Lens [65.16089054531395]
The U.S. Census Bureau releases data sets and statistics about groups of individuals that are used as input to a number of critical decision processes. To conform to privacy and confidentiality requirements, these agencies are often required to release privacy-preserving versions of the data. This paper studies the release of differentially private data sets and analyzes their impact on some critical resource allocation tasks under a fairness perspective.
arXiv Detail & Related papers (2021-05-16T21:04:19Z)
Prioritized Multi-Criteria Federated Learning [16.35440946424973]
In Machine Learning scenarios, privacy is a crucial concern when models have to be trained with private data coming from users of a service. We propose Federated Learning (FL) as a means to build ML models based on private datasets distributed over a large number of clients. A central coordinating server receives locally computed updates by clients and aggregate them to obtain a better global model.
arXiv Detail & Related papers (2020-07-17T10:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.