Benchmarking and Improving LLM Robustness for Personalized Generation
- URL: http://arxiv.org/abs/2509.19358v1
- Date: Thu, 18 Sep 2025 13:56:14 GMT
- Title: Benchmarking and Improving LLM Robustness for Personalized Generation
- Authors: Chimaobi Okite, Naihao Deng, Kiran Bodipati, Huaidian Hou, Joyce Chai, Rada Mihalcea,
- Abstract summary: We define a model as robust if its responses are both factually accurate and align with the user preferences.<n>Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned deployments.
- Score: 42.26075952121524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user's preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.
Related papers
- When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger [12.541521203916867]
Preference alignment is an essential step in adapting large language models to human values.<n>We propose Confidence-Weighted Preference Optimization (CW-PO), a framework that re-weights training samples by a weak LLM's confidence.<n>CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO.
arXiv Detail & Related papers (2026-03-05T09:06:25Z) - Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference [10.009730627424629]
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks.<n>We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates.
arXiv Detail & Related papers (2026-02-25T16:38:03Z) - One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning [54.580646706013965]
Reward models (RMs) play a critical role in aligning large language models with human preferences.<n>We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios.<n>To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling.
arXiv Detail & Related papers (2025-10-30T06:08:27Z) - Investigating LLM Variability in Personalized Conversational Information Retrieval [14.220276130333849]
Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into Large Language Models (LLMs)<n>We apply the original methods to a new TREC iKAT 2024 dataset and evaluate a diverse range of models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini.<n>Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices.
arXiv Detail & Related papers (2025-10-04T12:13:19Z) - User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models [4.875712300661656]
We present SCORE ($mathbfS$ystematic $mathbfCO$nsistency and $mathbfR$obustness $mathbfE$valuation), a comprehensive framework for non-adversarial evaluation of Large Language Models.<n>The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency.
arXiv Detail & Related papers (2025-02-28T19:27:29Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation [50.837277466987345]
We focus on the field of large language models (LLMs) for recommendation.
We propose RecLoRA, which incorporates a Personalized LoRA module that maintains independent LoRAs for different users.
We also design a Few2Many Learning Strategy, using a conventional recommendation model as a lens to magnify small training spaces to full spaces.
arXiv Detail & Related papers (2024-08-07T04:20:28Z) - Large Language Model Confidence Estimation via Black-Box Access [30.490207799344333]
We explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them.<n>We propose a simple and generalize framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence.<n>We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2,-13b, Mistral-7b and GPT-4 on four benchmark Q&A tasks as well as Pegasus-large and BART-large on two benchmark summarization tasks.
arXiv Detail & Related papers (2024-06-01T02:08:44Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes.
Our analysis reveals that the impact of diversified human preferences depends on both model size and data size.
Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.