Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers
- URL: http://arxiv.org/abs/2510.10082v1
- Date: Sat, 11 Oct 2025 07:40:20 GMT
- Title: Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers
- Authors: Parthiv Chatterjee, Shivam Sonawane, Amey Hengle, Aditya Tanna, Sourish Dasgupta, Tanmoy Chakraborty,
- Abstract summary: $mathrmPerAugy$ is a novel cross-trajectory shuffling and summary-content perturbation technique.<n>We show that it significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks.<n>As a post-hoc analysis of the role of induced diversity in the augmented dataset by peraugy, we introduce three dataset diversity metrics.
- Score: 16.572159435616456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose $\mathrm{PerAugy}$, a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: $\text{61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by \peraugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and \degreed\ to quantify the induced diversity. We find that $\mathrm{TP}$ and $\mathrm{DegreeD}$ strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.
Related papers
- Rethinking Representativeness and Diversity in Dynamic Data Selection [32.400383488290906]
Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy.<n>We rethink two core notions underlying sample evaluation: representativeness and diversity.<n>Our method matches or exceeds full-data accuracy with over 2x training acceleration.
arXiv Detail & Related papers (2026-03-05T09:21:58Z) - Lightweight Inference-Time Personalization for Frozen Knowledge Graph Embeddings [0.0]
GatedBias is a lightweight inference-time personalization framework for knowledge graphs.<n>Profile-specific features combine with graph-derived binary gates to produce interpretable, per-entity biases.<n>We evaluate GatedBias on two benchmark datasets.
arXiv Detail & Related papers (2025-12-26T22:30:37Z) - Bayesian Coreset Optimization for Personalized Federated Learning [15.758663902804075]
In a distributed machine learning setting like Federated Learning, there are multiple clients involved which update their individual weights to a single central server.<n>We propose a personalized coreset weighted federated learning setup where the training updates for each individual clients are forwarded to the central server based on only individual client coreset based representative data points instead of the entire client data.
arXiv Detail & Related papers (2025-11-03T17:58:14Z) - FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors [50.131271229165165]
Federated Learning (FL) has emerged as a promising framework for distributed machine learning.<n>Data heterogeneity resulting from differences across user behaviors, preferences, and device characteristics poses a significant challenge for federated learning.<n>We propose Adaptive Weight Aggregation (FedAWA), a novel method that adaptively adjusts aggregation weights based on client vectors during the learning process.
arXiv Detail & Related papers (2025-03-20T04:49:40Z) - Diversity-oriented Data Augmentation with Large Language Models [9.548912625579947]
We propose a textbfunderlineDi-textbfunderlineoriented data textbfunderlineAugmentation framework (textbfDoAug)<n>Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases.<n>The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks.
arXiv Detail & Related papers (2025-02-17T11:00:40Z) - TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - $\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization [1.6958018695660049]
We show that $textbfonly emerges$ when training data is diversified enough across semantic domains.
We extend our analysis to real-world scenarios, including fine-tuning of $textit$textbfspecialist$$ and $textit$textbfgeneralist$$ models.
arXiv Detail & Related papers (2024-10-07T03:15:11Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - Diversity-Aware Meta Visual Prompting [111.75306320834629]
We present Diversity-Aware Meta Visual Prompting(DAM-VP), an efficient prompting method for transferring pre-trained models to downstream tasks with frozen backbone.
We cluster the downstream dataset into small subsets in a diversity-strapped way, with each subset has its own prompt separately.
All the prompts are optimized with a meta-prompt, which is learned across several datasets.
arXiv Detail & Related papers (2023-03-14T17:59:59Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.