Text Embeddings Reveal (Almost) As Much As Text
- URL: http://arxiv.org/abs/2310.06816v1
- Date: Tue, 10 Oct 2023 17:39:03 GMT
- Title: Text Embeddings Reveal (Almost) As Much As Text
- Authors: John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M.
Rush
- Abstract summary: We investigate the problem of embedding textitinversion, reconstructing the full text represented in dense text embeddings.
We find that although a na"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92%$ of $32text-token$ text inputs exactly.
- Score: 86.5822042193058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How much private information do text embeddings reveal about the original
text? We investigate the problem of embedding \textit{inversion},
reconstructing the full text represented in dense text embeddings. We frame the
problem as controlled generation: generating text that, when reembedded, is
close to a fixed point in latent space. We find that although a na\"ive model
conditioned on the embedding performs poorly, a multi-step method that
iteratively corrects and re-embeds text is able to recover $92\%$ of
$32\text{-token}$ text inputs exactly. We train our model to decode text
embeddings from two state-of-the-art embedding models, and also show that our
model can recover important personal information (full names) from a dataset of
clinical notes. Our code is available on Github:
\href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.
Related papers
- TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts.
Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale.
To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired.
We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - Med-EASi: Finely Annotated Dataset and Models for Controllable
Simplification of Medical Texts [32.57058284812338]
Automatic medical text simplification can assist providers with patient-friendly communication and make medical texts more accessible.
We present $textbfMed-EASi$ ($underlinetextbfMed$ical dataset for $underlinetextbfE$laborative and $underlinetextbfA$bstractive $underlinetextbfSi$mplification)
Our results show that our fine-grained annotations improve learning compared to the unannotated baseline.
arXiv Detail & Related papers (2023-02-17T21:50:13Z) - CORE-Text: Improving Scene Text Detection with Contrastive Relational
Reasoning [65.57338873921168]
Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision.
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module.
We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text.
arXiv Detail & Related papers (2021-12-14T16:22:25Z) - Rethinking Text Segmentation: A Novel Dataset and A Text-Specific
Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks.
We introduce Text Refinement Network (TexRNet), a novel text segmentation approach.
TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z) - All you need is a second look: Towards Tighter Arbitrary shape text
detection [80.85188469964346]
Long curve text instances tend to be fragmented because of the limited receptive field size of CNN.
Simple representations using rectangle or quadrangle bounding boxes fall short when dealing with more challenging arbitrary-shaped texts.
textitNASK reconstructs text instances with a more tighter representation using the predicted geometrical attributes.
arXiv Detail & Related papers (2020-04-26T17:03:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.