$d_X$-Privacy for Text and the Curse of Dimensionality
- URL: http://arxiv.org/abs/2411.13784v1
- Date: Thu, 21 Nov 2024 01:59:12 GMT
- Title: $d_X$-Privacy for Text and the Curse of Dimensionality
- Authors: Hassan Jameel Asghar, Robin Carpentier, Benjamin Zi Hao Zhao, Dali Kaafar,
- Abstract summary: A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy.
When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words.
We show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor.
- Score: 4.372695214012181
- License:
- Abstract: A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy, which is a relaxation of differential privacy for metric spaces. We identify an intriguing peculiarity of this mechanism. When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words. We investigate this observation in detail, and tie it to the fact that the distance of the nearest neighbor of a word in any word embedding model (which are high-dimensional) is much larger than the relative difference in distances to any of its two consecutive neighbors. We also show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor. We derive the distribution, moments and tail bounds of this dot product. We further propose a fix as a post-processing step, which satisfactorily removes the above-mentioned issue.
Related papers
- A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy [3.0177210416625124]
Several word-level $textitMetric$ Differential Privacy approaches have been proposed.
We devise a method where composed privatized outputs have higher semantic coherence and variable length.
We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
arXiv Detail & Related papers (2024-06-30T09:37:34Z) - A Neighbourhood-Aware Differential Privacy Mechanism for Static Word
Embeddings [29.514170092086598]
We propose a Neighbourhood-Aware Differential Privacy (NADP) mechanism considering the neighbourhood of a word in a pretrained static word embedding space.
We first construct a nearest neighbour graph over the words using their embeddings, and factorise it into a set of connected components.
We then separately apply different levels of Gaussian noise to the words in each neighbourhood, determined by the set of words in that neighbourhood.
arXiv Detail & Related papers (2023-09-19T11:58:08Z) - Driving Context into Text-to-Text Privatization [0.0]
textitMetric Differential Privacy enables text-to-text privatization by adding noise to the vector of a word.
We demonstrate a substantial increase in classification accuracy by $6.05%$.
arXiv Detail & Related papers (2023-06-02T11:33:06Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - A Differentially Private Text Perturbation Method Using a Regularized
Mahalanobis Metric [8.679020335206753]
A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space.
We propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem.
We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees.
arXiv Detail & Related papers (2020-10-22T23:06:44Z) - Multiplex Word Embeddings for Selectional Preference Acquisition [70.33531759861111]
We propose a multiplex word embedding model, which can be easily extended according to various relations among words.
Our model can effectively distinguish words with respect to different relations without introducing unnecessary sparseness.
arXiv Detail & Related papers (2020-01-09T04:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.