A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
- URL: http://arxiv.org/abs/2407.00638v1
- Date: Sun, 30 Jun 2024 09:37:34 GMT
- Title: A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
- Authors: Stephen Meisenbacher, Maulik Chevli, Florian Matthes,
- Abstract summary: Several word-level $textitMetric$ Differential Privacy approaches have been proposed.
We devise a method where composed privatized outputs have higher semantic coherence and variable length.
We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
- Score: 3.0177210416625124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of $\textit{word-level}$ or $\textit{document-level}$ privatization. Recently, several word-level $\textit{Metric}$ Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating $\textit{between}$ the word and sentence levels, namely with $\textit{collocations}$. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
Related papers
- $d_X$-Privacy for Text and the Curse of Dimensionality [4.372695214012181]
A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy.
When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words.
We show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor.
arXiv Detail & Related papers (2024-11-21T01:59:12Z) - Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z) - Guiding Text-to-Text Privatization by Syntax [0.0]
Metric Differential Privacy is a generalization of differential privacy tailored to address the unique challenges of text-to-text privatization.
We analyze the capability of text-to-text privatization to preserve the grammatical category of words after substitution.
We transform the privatization step into a candidate selection problem in which substitutions are directed to words with matching grammatical properties.
arXiv Detail & Related papers (2023-06-02T11:52:21Z) - Driving Context into Text-to-Text Privatization [0.0]
textitMetric Differential Privacy enables text-to-text privatization by adding noise to the vector of a word.
We demonstrate a substantial increase in classification accuracy by $6.05%$.
arXiv Detail & Related papers (2023-06-02T11:33:06Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Extending Multi-Sense Word Embedding to Phrases and Sentences for
Unsupervised Semantic Applications [34.71597411512625]
We propose a novel embedding method for a text sequence (a phrase or a sentence) where each sequence is represented by a distinct set of codebook embeddings.
Our experiments show that the per-sentence codebook embeddings significantly improve the performances in unsupervised sentence similarity and extractive summarization benchmarks.
arXiv Detail & Related papers (2021-03-29T04:54:28Z) - SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in
BERT-based Embedding Spaces [63.17308641484404]
We propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings.
Disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages.
Our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.
arXiv Detail & Related papers (2020-10-02T08:38:40Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model.
First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model.
Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.