Related papers: On the Credibility of Evaluating LLMs using Survey Questions

On the Credibility of Evaluating LLMs using Survey Questions

URL: http://arxiv.org/abs/2602.04033v1
Date: Tue, 03 Feb 2026 21:45:43 GMT
Title: On the Credibility of Evaluating LLMs using Survey Questions
Authors: Jindřich Libovický,
Abstract summary: Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys.<n>This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation.<n>Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results.
Score: 0.42061757959666934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.

Related papers

A-VERT: Agnostic Verification with Embedding Ranking Targets [0.0]
A structure-free evaluation method is presented.<n>The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text.<n>The results show a regression score of 0.97 and an accuracy of 96% against human annotators.
arXiv Detail & Related papers (2025-10-01T21:26:03Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics [13.001452293293596]
Large language models (LLMs) are prone to hallucinations and sensitiveto prompt perturbations.<n>We propose a logit-based ensemble method for estimatingLLM consistency.
arXiv Detail & Related papers (2025-05-26T16:53:47Z)
An Empirical Study of Evaluating Long-form Question Answering [77.8023489322551]
We collect 5,236 factoid and non-factoid long-form answers generated by different large language models.<n>We conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness.<n>We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics.
arXiv Detail & Related papers (2025-04-25T15:14:25Z)
Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z)
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z)
An unsupervised learning approach to evaluate questionnaire data -- what one can learn from violations of measurement invariance [2.4762962548352467]
This paper promotes an unsupervised learning-based approach to such research data. It works in three phases: data preparation, clustering of questionnaires, and measuring similarity based on the obtained clustering and the properties of each group. It provides a natural comparison between groups and a natural description of the response patterns of the groups.
arXiv Detail & Related papers (2023-12-11T11:31:41Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.