Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant
- URL: http://arxiv.org/abs/2407.10994v4
- Date: Mon, 10 Feb 2025 15:08:07 GMT
- Title: Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant
- Authors: Armand Nicolicioiu, Eugenia Iofinova, Andrej Jovanovic, Eldar Kurtic, Mahdi Nikdan, Andrei Panferov, Ilia Markov, Nir Shavit, Dan Alistarh,
- Abstract summary: We present a new design and evaluation for such an automated assistant, which we call Panza.
Panza's personalization features are based on a combination of fine-tuning using a variant of the Reverse Instructions technique together with Retrieval-Augmented Generation.
We demonstrate that this combination allows us to fine-tune an LLM to reflect a user's writing style using limited data, while executing on extremely limited resources.
- Score: 28.752596543740225
- License:
- Abstract: The availability of powerful open-source large language models (LLMs) opens exciting use-cases, such as using personal data to fine-tune these models to imitate a user's unique writing style. Two key requirements for such assistants are personalization - in the sense that the assistant should recognizably reflect the user's own writing style - and privacy - users may justifiably be wary of uploading extremely personal data, such as their email archive, to a third-party service. In this paper, we present a new design and evaluation for such an automated assistant, for the specific use case of email generation, which we call Panza. Panza's personalization features are based on a combination of fine-tuning using a variant of the Reverse Instructions technique together with Retrieval-Augmented Generation (RAG). We demonstrate that this combination allows us to fine-tune an LLM to reflect a user's writing style using limited data, while executing on extremely limited resources, e.g. on a free Google Colab instance. Our key methodological contribution is the first detailed study of evaluation metrics for this personalized writing task, and of how different choices of system components--the use of RAG and of different fine-tuning approaches-impact the system's performance. Additionally, we demonstrate that very little data - under 100 email samples - are sufficient to create models that convincingly imitate humans. This finding showcases a previously-unknown attack vector in language models - that access to a small number of writing samples can allow a bad actor to cheaply create generative models that imitate a target's writing style. We are releasing the full Panza code as well as three new email datasets licensed for research use at https://github.com/IST-DASLab/PanzaMail.
Related papers
- Capturing Style in Author and Document Representation [4.323709559692927]
We propose a new architecture that learns embeddings for both authors and documents with a stylistic constraint.
We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship and IMDb62.
arXiv Detail & Related papers (2024-07-18T10:01:09Z) - Step-Back Profiling: Distilling User History for Personalized Scientific Writing [50.481041470669766]
Large language models (LLM) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals.
We introduce STEP-BACK PROFILING to personalize LLMs by distilling user history into concise profiles.
Our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark.
arXiv Detail & Related papers (2024-06-20T12:58:26Z) - Weaver: Foundation Models for Creative Writing [61.26716770063019]
We introduce Weaver, our first family of large language models (LLMs) dedicated to content creation.
Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models.
We fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers.
arXiv Detail & Related papers (2024-01-30T18:58:43Z) - Generating Illustrated Instructions [41.613203340244155]
We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs.
We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion.
arXiv Detail & Related papers (2023-12-07T18:59:20Z) - Who's Harry Potter? Approximate Unlearning in LLMs [4.821438899378393]
Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content.
This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
We propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
arXiv Detail & Related papers (2023-10-03T17:48:14Z) - PerPLM: Personalized Fine-tuning of Pretrained Language Models via
Writer-specific Intermediate Learning and Prompts [16.59511985633798]
Pretrained language models (PLMs) are powerful tools for capturing context.
PLMs are typically pretrained and fine-tuned for universal use across different writers.
This study aims to improve the accuracy of text understanding tasks by personalizing the fine-tuning of PLMs for specific writers.
arXiv Detail & Related papers (2023-09-14T14:03:48Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition [36.12001394921506]
We propose a new approach to handwritten text recognition.
We use a novel meta-learning framework which exploits additional new-writer data.
Our framework can be easily implemented on the top of most state-of-the-art HTR models.
arXiv Detail & Related papers (2021-04-05T12:35:39Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.