Bootstrapping Text Anonymization Models with Distant Supervision
- URL: http://arxiv.org/abs/2205.06895v1
- Date: Fri, 13 May 2022 21:10:14 GMT
- Title: Bootstrapping Text Anonymization Models with Distant Supervision
- Authors: Anthi Papadopoulou, Pierre Lison, Lilja {\O}vrelid, Ildik\'o Pil\'an
- Abstract summary: We propose a novel method to bootstrap text anonymization models based on distant supervision.
Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available.
- Score: 2.121963121603413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel method to bootstrap text anonymization models based on
distant supervision. Instead of requiring manually labeled training data, the
approach relies on a knowledge graph expressing the background information
assumed to be publicly available about various individuals. This knowledge
graph is employed to automatically annotate text documents including personal
data about a subset of those individuals. More precisely, the method determines
which text spans ought to be masked in order to guarantee $k$-anonymity,
assuming an adversary with access to both the text documents and the background
information expressed in the knowledge graph. The resulting collection of
labeled documents is then used as training data to fine-tune a pre-trained
language model for text anonymization. We illustrate this approach using a
knowledge graph extracted from Wikidata and short biographical texts from
Wikipedia. Evaluation results with a RoBERTa-based model and a manually
annotated collection of 553 summaries showcase the potential of the approach,
but also unveil a number of issues that may arise if the knowledge graph is
noisy or incomplete. The results also illustrate that, contrary to most
sequence labeling problems, the text anonymization task may admit several
alternative solutions.
Related papers
- GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification [0.0]
The processing of text data requires embedding, a method of translating the content of the text to numeric vectors.
A new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed.
The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms.
arXiv Detail & Related papers (2024-04-25T18:48:11Z) - Unsupervised Learning of Graph from Recipes [8.410402833223364]
We propose a model to identify relevant information from recipes and generate a graph to represent the sequence of actions in the recipe.
We iteratively learn the graph structure and the parameters of a $mathsfGNN$ encoding the texts (text-to-graph) one sequence at a time.
We evaluate the approach by comparing the identified entities with annotated datasets, comparing the difference between the input and output texts, and comparing our generated graphs with those generated by state of the art methods.
arXiv Detail & Related papers (2024-01-22T16:25:47Z) - Answer Candidate Type Selection: Text-to-Text Language Model for Closed
Book Question Answering Meets Knowledge Graphs [62.20354845651949]
We present a novel approach which works on top of the pre-trained Text-to-Text QA system to address this issue.
Our simple yet effective method performs filtering and re-ranking of generated candidates based on their types derived from Wikidata "instance_of" property.
arXiv Detail & Related papers (2023-10-10T20:49:43Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.