Related papers: The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

URL: http://arxiv.org/abs/2512.04343v1
Date: Thu, 04 Dec 2025 00:12:41 GMT
Title: The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A
Authors: Satyajit Movidi, Stephen Russell,
Abstract summary: We examined how personalization affects system performance across multiple evaluation dimensions.<n>Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding.<n>The study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements.
Score: 0.5623023138026949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.

Related papers

Optimizing In-Context Demonstrations for LLM-based Automated Grading [31.353360036776976]
GUIDE (Grading Using Iteratively Designed Exemplars) is a framework that reframes exemplar selection and refinement as a boundary-focused optimization problem.<n>We show that GUIDE significantly outperforms standard retrieval baselines in experiments in physics, chemistry, and pedagogical content knowledge.
arXiv Detail & Related papers (2026-02-28T04:52:38Z)
Synthetic Interaction Data for Scalable Personalization in Large Language Models [67.31884245564086]
We introduce a high-fidelity synthetic data generation framework called PersonaGym.<n>Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process.<n>We release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories.
arXiv Detail & Related papers (2026-02-12T20:41:22Z)
Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It [81.50711040539566]
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges.<n>We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks.<n>Our framework creates scenarios where identical questions require different reasoning chains depending on user context.
arXiv Detail & Related papers (2025-09-30T18:55:28Z)
The Unanticipated Asymmetry Between Perceptual Optimization and Assessment [15.11427750828098]
We show that fidelity metrics that excel in image quality assessment (IQA) are not necessarily effective for perceptual optimization.<n>We also show that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives.
arXiv Detail & Related papers (2025-09-25T08:08:26Z)
Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering [57.12316804290369]
Personalization is essential for adapting question answering systems to user-specific information needs.<n>We propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning.<n>PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement.
arXiv Detail & Related papers (2025-09-23T14:44:46Z)
Objective Metrics for Evaluating Large Language Models Using External Data Sources [4.574672973076743]
This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters.<n>The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation.<n>This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
arXiv Detail & Related papers (2025-08-01T02:24:19Z)
Metric Design != Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction [10.099350224451387]
dimensionality reduction (DR) projections are crucial for reliable visual analytics.<n>DR projections can become biased if highly correlated metrics--those measuring similar structural characteristics--are inadvertently selected.<n>We propose a novel workflow that reduces bias in the selection of evaluation metrics by clustering metrics based on their empirical correlations.
arXiv Detail & Related papers (2025-07-03T01:07:02Z)
AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset [89.37514696019484]
Preference learning is critical for aligning large language models with human values.<n>Our work shifts preference dataset design from ad hoc scaling to component-aware optimization.
arXiv Detail & Related papers (2025-04-04T17:33:07Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability [1.3812010983144802]
Surrogate Optimization (SO) is a common resolution, yet its proprietary nature leads to a lack of explainability and transparency. We propose emphInclusive Explainability Metrics for Surrogate Optimization (IEMSO) These metrics enhance the transparency, trustworthiness, and explainability of the SO approaches.
arXiv Detail & Related papers (2024-10-18T16:20:17Z)
Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options [2.1184929769291294]
This work introduces a novel framework for evaluating LLMs' capacity to balance instruction-following with critical reasoning.<n>We show that post-training aligned models often default to selecting invalid options, while base models exhibit improved refusal capabilities that scale with model size.<n>We additionally conduct a parallel human study showing similar instruction-following biases, with implications for how these biases may propagate through human feedback datasets used in alignment.
arXiv Detail & Related papers (2024-08-27T19:27:43Z)
Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.