InvBERT: Text Reconstruction from Contextualized Embeddings used for
Derived Text Formats of Literary Works
- URL: http://arxiv.org/abs/2109.10104v1
- Date: Tue, 21 Sep 2021 11:35:41 GMT
- Title: InvBERT: Text Reconstruction from Contextualized Embeddings used for
Derived Text Formats of Literary Works
- Authors: Johannes H\"ohmann, Achim Rettinger, and Kai Kugler
- Abstract summary: Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature.
Due to copyright restrictions, the availability of relevant digitized literary works is limited.
Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical.
- Score: 1.6058099298620423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Digital Humanities and Computational Literary Studies apply text mining
methods to investigate literature. Such automated approaches enable
quantitative studies on large corpora which would not be feasible by manual
inspection alone. However, due to copyright restrictions, the availability of
relevant digitized literary works is limited. Derived Text Formats (DTFs) have
been proposed as a solution. Here, textual materials are transformed in such a
way that copyright-critical features are removed, but that the use of certain
analytical methods remains possible. Contextualized word embeddings produced by
transformer-encoders (like BERT) are promising candidates for DTFs because they
allow for state-of-the-art performance on various analytical tasks and, at
first sight, do not disclose the original text. However, in this paper we
demonstrate that under certain conditions the reconstruction of the original
copyrighted text becomes feasible and its publication in the form of
contextualized word representations is not safe. Our attempts to invert BERT
suggest, that publishing parts of the encoder together with the contextualized
embeddings is critical, since it allows to generate data to train a decoder
with a reconstruction accuracy sufficient to violate copyright laws.
Related papers
- TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images [84.08181780666698]
TextDestroyer is the first training- and annotation-free method for scene text destruction.
Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction.
The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.
arXiv Detail & Related papers (2024-11-01T04:41:00Z) - Are Paraphrases Generated by Large Language Models Invertible? [4.148732457277201]
We consider the problem of paraphrase inversion: given a paraphrased document, attempt to recover the original text.
We fine-tune paraphrase inversion models, both with and without additional author-specific context.
We show that, when starting from paraphrased machine-generated text, we can recover significant portions of the document using a learned inversion model.
arXiv Detail & Related papers (2024-10-29T00:46:24Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization.
We present experiments with a range of language models over a collection of popular books and coding problems.
Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - Synthetically generated text for supervised text analysis [5.71097144710995]
I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text.
I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
arXiv Detail & Related papers (2023-03-28T14:55:13Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - Tracing Text Provenance via Context-Aware Lexical Substitution [81.49359106648735]
We propose a natural language watermarking scheme based on context-aware lexical substitution.
Under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences.
arXiv Detail & Related papers (2021-12-15T04:27:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.