Driving Context into Text-to-Text Privatization
- URL: http://arxiv.org/abs/2306.01457v1
- Date: Fri, 2 Jun 2023 11:33:06 GMT
- Title: Driving Context into Text-to-Text Privatization
- Authors: Stefan Arnold, Dilara Yesilbas, Sven Weinzierl
- Abstract summary: textitMetric Differential Privacy enables text-to-text privatization by adding noise to the vector of a word.
We demonstrate a substantial increase in classification accuracy by $6.05%$.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: \textit{Metric Differential Privacy} enables text-to-text privatization by
adding calibrated noise to the vector of a word derived from an embedding space
and projecting this noisy vector back to a discrete vocabulary using a nearest
neighbor search. Since words are substituted without context, this mechanism is
expected to fall short at finding substitutes for words with ambiguous
meanings, such as \textit{'bank'}. To account for these ambiguous words, we
leverage a sense embedding and incorporate a sense disambiguation step prior to
noise injection. We encompass our modification to the privatization mechanism
with an estimation of privacy and utility. For word sense disambiguation on the
\textit{Words in Context} dataset, we demonstrate a substantial increase in
classification accuracy by $6.05\%$.
Related papers
- $d_X$-Privacy for Text and the Curse of Dimensionality [4.372695214012181]
A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy.
When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words.
We show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor.
arXiv Detail & Related papers (2024-11-21T01:59:12Z) - A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy [3.0177210416625124]
Several word-level $textitMetric$ Differential Privacy approaches have been proposed.
We devise a method where composed privatized outputs have higher semantic coherence and variable length.
We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level.
arXiv Detail & Related papers (2024-06-30T09:37:34Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Guiding Text-to-Text Privatization by Syntax [0.0]
Metric Differential Privacy is a generalization of differential privacy tailored to address the unique challenges of text-to-text privatization.
We analyze the capability of text-to-text privatization to preserve the grammatical category of words after substitution.
We transform the privatization step into a candidate selection problem in which substitutions are directed to words with matching grammatical properties.
arXiv Detail & Related papers (2023-06-02T11:52:21Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - An Iterative Contextualization Algorithm with Second-Order Attention [0.40611352512781856]
We show how to combine the representations of words that make up a sentence into a cohesive whole.
Our algorithm starts with a presumably erroneous value of the context, and adjusts this value with respect to the tokens at hand.
Our models report strong results in several well-known text classification tasks.
arXiv Detail & Related papers (2021-03-03T05:34:50Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model.
First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model.
Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z) - Towards Semantic Noise Cleansing of Categorical Data based on Semantic
Infusion [4.825584239754082]
We formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text.
We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text.
We propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms.
arXiv Detail & Related papers (2020-02-06T13:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.