Related papers: Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

URL: http://arxiv.org/abs/2602.17262v1
Date: Thu, 19 Feb 2026 11:07:24 GMT
Title: Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
Authors: Kensuke Okada, Yui Furukawa, Kyosuke Bunji,
Abstract summary: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models.<n>We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of large language models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

Related papers

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks [31.017987800426894]
DREAM is a multi-round debate-based relevance assessment framework with LLM agents.<n>It achieves 95.2% labeling accuracy with only 3.5% human involvement.<n>BRIDGE is a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison.
arXiv Detail & Related papers (2026-02-06T09:27:03Z)
SSR: Socratic Self-Refine for Large Language Model Reasoning [78.62319252287938]
Socratic Self-Refine (SSR) is a novel framework for fine-grained evaluation and precise refinement of Large Language Models (LLMs)<n>Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation.<n> Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines.
arXiv Detail & Related papers (2025-11-13T18:47:07Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering [12.879551933541345]
We propose the Dynamic Arbitration Framework for Evaluation (DAFE) to evaluate large language models.<n>DAFE employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements.<n>We show DAFE's ability to provide consistent, scalable, and resource-efficient assessments.
arXiv Detail & Related papers (2025-03-11T15:29:55Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.