Related papers: ComMer: a Framework for Compressing and Merging User Data for Personalization

ComMer: a Framework for Compressing and Merging User Data for Personalization

URL: http://arxiv.org/abs/2501.03276v1
Date: Sun, 05 Jan 2025 09:57:03 GMT
Title: ComMer: a Framework for Compressing and Merging User Data for Personalization
Authors: Yoel Zeldes, Amir Zait, Ilia Labzovsky, Danny Karmon, Efrat Farkash,
Abstract summary: Large Language Models (LLMs) excel at a wide range of tasks, but adapting them to new data, particularly for personalized applications, poses significant challenges.<n>Existing methods either rely on exposing fresh data to the model through the prompt, which is limited by context size and computationally expensive at inference time, or fine-tuning, which incurs substantial training and update costs.<n>In this paper, we introduce ComMer - Compress and Merge - a novel framework that efficiently personalizes LLMs by compressing users' documents into compact representations, which are then merged and fed into a frozen LLM.
Score: 0.23301643766310368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at a wide range of tasks, but adapting them to new data, particularly for personalized applications, poses significant challenges due to resource and computational constraints. Existing methods either rely on exposing fresh data to the model through the prompt, which is limited by context size and computationally expensive at inference time, or fine-tuning, which incurs substantial training and update costs. In this paper, we introduce ComMer - Compress and Merge - a novel framework that efficiently personalizes LLMs by compressing users' documents into compact representations, which are then merged and fed into a frozen LLM. We evaluate ComMer on two types of personalization tasks - personalized skill learning, using the tweet paraphrasing dataset and the personalized news headline generation dataset from the LaMP benchmark, and knowledge-intensive, using the PerLTQA dataset. Our experiments demonstrate that in constrained inference budget scenarios ComMer achieves superior quality in skill learning tasks, while highlighting limitations in knowledge-intensive settings due to the loss of detailed information. These results offer insights into trade-offs and potential optimizations in multi-document compression for personalization.

Related papers

Towards Efficient and Effective Alignment of Large Language Models [7.853945494882636]
Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge.<n>This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation.
arXiv Detail & Related papers (2025-06-11T02:08:52Z)
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance [38.362162910767466]
We conduct the first comprehensive analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk.<n>We derive statistics that reveal structural and qualitative similarities and differences between the two datasets.<n>Our findings offer actionable insights for constructing more effective post-training datasets.
arXiv Detail & Related papers (2025-06-06T20:34:06Z)
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora [2.3251886193174114]
We present an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
arXiv Detail & Related papers (2025-02-19T13:03:06Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Retrieval-Augmented Personalization for Multimodal Large Language Models [53.304699445700926]
We introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. RAP allows real-time concept editing via updating the external database. RAP-MLLMs can generalize to infinite visual concepts without additional finetuning.
arXiv Detail & Related papers (2024-10-17T09:10:26Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
Rethinking the Instruction Quality: LIFT is What You Need [20.829372251475476]
Existing quality improvement methods alter instruction data through dataset expansion or curation. We propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights. Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs consistently uphold robust performance across various tasks.
arXiv Detail & Related papers (2023-12-12T03:30:21Z)
TF-DCon: Leveraging Large Language Models (LLMs) to Empower Training-Free Dataset Condensation for Content-Based Recommendation [28.567219434790875]
Modern techniques in Content-based Recommendation (CBR) leverage item content information to provide personalized services to users, but suffer from resource-intensive training on large datasets. We propose dataset condensation to synthesize a small yet informative dataset, upon which models can achieve performance comparable to those trained on large datasets. We are able to approximate up to 97% of the original performance while reducing the dataset size by 95% (i.e., on dataset MIND)
arXiv Detail & Related papers (2023-10-15T16:15:07Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.