Related papers: Are Large Language Models Consistent over Value-laden Questions?

Are Large Language Models Consistent over Value-laden Questions?

URL: http://arxiv.org/abs/2407.02996v2
Date: Tue, 01 Oct 2024 21:23:18 GMT
Title: Are Large Language Models Consistent over Value-laden Questions?
Authors: Jared Moore, Tanvi Deshpande, Diyi Yang,
Abstract summary: Large language models (LLMs) appear to bias their survey answers toward certain values. We define value consistency as the similarity of answers across paraphrases, use-cases, translations, and within a topic. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic.
Score: 45.37331974356809
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to small and large, open LLMs including llama-3, as well as gpt-4o, using 8,000 questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).

Related papers

Measuring Political Stance and Consistency in Large Language Models [1.1296803881058548]
We assess the stances of nine Large Language Models on 24 politically sensitive issues using five prompting techniques.<n>We find that models often adopt opposing stances on several issues; some positions are malleable under prompting, while others remain stable.
arXiv Detail & Related papers (2026-01-15T06:12:40Z)
A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History [0.15293427903448023]
The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself.<n>Results show that binary response stability is relatively high but far from perfect and varies by language.
arXiv Detail & Related papers (2025-09-28T13:03:09Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks. We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K. We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues. We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z)
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom [4.142301960178498]
SwordsmanImp is the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions.
arXiv Detail & Related papers (2024-04-30T12:43:53Z)
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ [16.637598165238934]
Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. Recent research shows that, despite limits in their intended use, people prompt LLMs in many different languages. We introduce MultiQ, a new silver standard benchmark for basic open-ended question answering with 27.4k test questions.
arXiv Detail & Related papers (2024-03-06T16:01:44Z)
What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z)
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning [44.53966523376327]
SeaEval is a benchmark for multilingual foundation models. We characterize how these models understand and reason with natural language. We also investigate how well they comprehend cultural practices, nuances, and values.
arXiv Detail & Related papers (2023-09-09T11:42:22Z)
Negated Complementary Commonsense using Large Language Models [3.42658286826597]
This work focuses on finding answers to negated complementary questions in commonsense scenarios. We propose a model-agnostic methodology to improve the performance in negated complementary scenarios.
arXiv Detail & Related papers (2023-07-13T15:03:48Z)
Speaking Multiple Languages Affects the Moral Bias of Language Models [70.94372902010232]
Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. Do the models capture moral norms from English and impose them on other languages? Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions.
arXiv Detail & Related papers (2022-11-14T20:08:54Z)
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors. We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.