Related papers: Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

URL: http://arxiv.org/abs/2603.04897v1
Date: Thu, 05 Mar 2026 07:38:37 GMT
Title: Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Authors: Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos,
Abstract summary: We evaluate large language models (LLMs) on the task of identifying the top three human values expressed in long-form interviews.<n>We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts.
Score: 39.146761527401424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.

Related papers

Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy [1.481550828146527]
Annotators' Instruction Assisted Prompt (AIAP) aims to standardize the understanding of sentiment across both human and machine interpretations.<n>We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance.<n>This context-aware approach yields incremental gains in performance and also introduces an innovative sentiment-indexing method.
arXiv Detail & Related papers (2025-05-09T19:44:04Z)
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives [3.7931130268412194]
CLASH is a dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values.<n> CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes.<n>Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions.
arXiv Detail & Related papers (2025-04-15T02:54:16Z)
An Empirical Analysis of Uncertainty in Large Language Model Evaluations [28.297464655099034]
We conduct experiments involving 9 widely used LLM evaluators across 2 different evaluation settings.<n>We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes.<n>We find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent.
arXiv Detail & Related papers (2025-02-15T07:45:20Z)
Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs.<n>Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z)
The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate [17.77014177096838]
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA dataset.
arXiv Detail & Related papers (2024-02-09T06:16:08Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models [11.154814189699735]
Large Language Models (LLMs) pre-trained on extensive corpora have demonstrated superior performance across various NLP tasks. We introduce a retrieval-augmented LLMs framework for financial sentiment analysis. Our approach achieves 15% to 48% performance gain in accuracy and F1 score.
arXiv Detail & Related papers (2023-10-06T05:40:23Z)
Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models [53.620827459684094]
Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. We propose the first open-source comprehensive framework for exploring LLMs for credit scoring. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks.
arXiv Detail & Related papers (2023-10-01T03:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.