Related papers: Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?

Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?

URL: http://arxiv.org/abs/2507.13490v1
Date: Thu, 17 Jul 2025 18:56:41 GMT
Title: Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?
Authors: Siqi Shen, Mehar Singh, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Rada Mihalcea,
Abstract summary: We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
Score: 81.49470136653665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been extensive research on assessing the value orientation of Large Language Models (LLMs) as it can shape user experiences across demographic groups. However, several challenges remain. First, while the Multiple Choice Question (MCQ) setting has been shown to be vulnerable to perturbations, there is no systematic comparison of probing methods for value probing. Second, it is unclear to what extent the probed values capture in-context information and reflect models' preferences for real-world actions. In this paper, we evaluate the robustness and expressiveness of value representations across three widely used probing strategies. We use variations in prompts and options, showing that all methods exhibit large variances under input perturbations. We also introduce two tasks studying whether the values are responsive to demographic context, and how well they align with the models' behaviors in value-related scenarios. We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions. Our work highlights the need for a more careful examination of LLM value probing and awareness of its limitations.

Related papers

Bridging the Gap: In-Context Learning for Modeling Human Disagreement [8.011316959982654]
Large Language Models (LLMs) have shown strong performance on NLP classification tasks.<n>This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection.
arXiv Detail & Related papers (2025-06-06T14:24:29Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)<n>We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.<n>We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses [34.77031649891843]
We introduce CLAVE, a novel framework which integrates two complementary Large Language Models (LLMs) This dual-model approach enables calibration with any value systems using 100 human-labeled samples per value type. We present ValEval, a comprehensive dataset comprising 13k+ (text,value,label) 12+s across diverse domains, covering three major value systems.
arXiv Detail & Related papers (2024-07-15T13:51:37Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models. We show that models give substantively different answers when not forced. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.