Fingerprinting Fine-tuned Language Models in the Wild
- URL: http://arxiv.org/abs/2106.01703v1
- Date: Thu, 3 Jun 2021 09:07:54 GMT
- Title: Fingerprinting Fine-tuned Language Models in the Wild
- Authors: Nirav Diwan, Tanmoy Chakravorty, Zubair Shafiq
- Abstract summary: We study the problem of large-scale fingerprinting of fine-tuned LMs in the wild.
Our results show that fine-tuning itself is the most effective in attributing the synthetic text generated by fine-tuned LMs.
- Score: 6.7034293304862755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are concerns that the ability of language models (LMs) to generate high
quality synthetic text can be misused to launch spam, disinformation, or
propaganda. Therefore, the research community is actively working on developing
approaches to detect whether a given text is organic or synthetic. While this
is a useful first step, it is important to be able to further fingerprint the
author LM to attribute its origin. Prior work on fingerprinting LMs is limited
to attributing synthetic text generated by a handful (usually < 10) of
pre-trained LMs. However, LMs such as GPT2 are commonly fine-tuned in a myriad
of ways (e.g., on a domain-specific text corpus) before being used to generate
synthetic text. It is challenging to fingerprinting fine-tuned LMs because the
universe of fine-tuned LMs is much larger in realistic scenarios. To address
this challenge, we study the problem of large-scale fingerprinting of
fine-tuned LMs in the wild. Using a real-world dataset of synthetic text
generated by 108 different fine-tuned LMs, we conduct comprehensive experiments
to demonstrate the limitations of existing fingerprinting approaches. Our
results show that fine-tuning itself is the most effective in attributing the
synthetic text generated by fine-tuned LMs.
Related papers
- Theoretical Proof that Generated Text in the Corpus Leads to the Collapse of Auto-regressive Language Models [26.117724170912552]
This paper presents theoretical proof that once a corpus (such as the World Wide Web) begins to incorporate generated text, LM collapse is bound to occur.
We express our concerns about the current situation in which an increasing amount of generated text may be used in LM training.
arXiv Detail & Related papers (2024-12-19T14:11:15Z) - Robustness of LLMs to Perturbations in Text [2.0670689746336]
Large language models (LLMs) have shown impressive performance, but can they handle the inevitable noise in real-world data?
This work tackles this critical question by investigating LLMs' resilience against morphological variations in text.
Our findings show that contrary to popular beliefs, generative LLMs are quiet robust to noisy perturbations in text.
arXiv Detail & Related papers (2024-07-12T04:50:17Z) - Differentially Private Synthetic Data via Foundation Model APIs 2: Text [56.13240830670327]
A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns.
We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text.
Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
arXiv Detail & Related papers (2024-03-04T05:57:50Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors [54.80516786370663]
FreeReal is a real-domain-aligned pre-training paradigm that enables the complementary strengths of LSD and real data.
GlyphMix embeds synthetic images as graffiti-like units onto real images.
FreeReal consistently outperforms previous pre-training methods by a substantial margin across four public datasets.
arXiv Detail & Related papers (2023-12-08T15:10:55Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Contrastive Decoding: Open-ended Text Generation as Optimization [153.35961722855686]
We propose contrastive decoding (CD), a reliable decoding approach.
It is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs.
CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone.
arXiv Detail & Related papers (2022-10-27T00:58:21Z) - Factuality Enhanced Language Models for Open-Ended Text Generation [60.27166549575472]
We design the FactualityPrompts test set and metrics to measure the factuality of LM generations.
We find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions.
We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion.
arXiv Detail & Related papers (2022-06-09T17:16:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.