LLMs as mediators: Can they diagnose conflicts accurately?
- URL: http://arxiv.org/abs/2412.14675v1
- Date: Thu, 19 Dec 2024 09:29:08 GMT
- Title: LLMs as mediators: Can they diagnose conflicts accurately?
- Authors: Özgecan Koçak, Phanish Puranam, Afşar Yegin,
- Abstract summary: We find that OpenAI's Large Language Models GPT 3.5 and GPT 4 can reliably distinguish between causal and moral codes.
When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement.
- Score: 0.0
- License:
- Abstract: Prior research indicates that to be able to mediate conflict, observers of disagreements between parties must be able to reliably distinguish the sources of their disagreement as stemming from differences in beliefs about what is true (causality) vs. differences in what they value (morality). In this paper, we test if OpenAI's Large Language Models GPT 3.5 and GPT 4 can perform this task and whether one or other type of disagreement proves particularly challenging for LLM's to diagnose. We replicate study 1 in Ko\c{c}ak et al. (2003), which employes a vignette design, with OpenAI's GPT 3.5 and GPT 4. We find that both LLMs have similar semantic understanding of the distinction between causal and moral codes as humans and can reliably distinguish between them. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement and underestimate the extent of moral disagreement in the moral misalignment condition. This tendency is especially pronounced for GPT 4 when using a proximate scale that relies on concrete language specific to an issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the proximate or the distal scale. The study provides a first test of the potential for using LLMs to mediate conflict by diagnosing the root of disagreements in causal and evaluative codes.
Related papers
- Do Large Language Models Reason Causally Like Us? Even Better? [7.749713014052951]
Large language models (LLMs) have shown impressive capabilities in generating human-like text.
We compare causal reasoning in humans and four LLMs using tasks based on collider graphs.
We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task.
arXiv Detail & Related papers (2025-02-14T15:09:15Z) - Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism.
This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE.
Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios.
Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z) - Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - What Evidence Do Language Models Find Convincing? [94.90663008214918]
We build a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts.
We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions.
Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important.
arXiv Detail & Related papers (2024-02-19T02:15:34Z) - Cognitive Dissonance: Why Do Language Model Outputs Disagree with
Internal Representations of Truthfulness? [53.98071556805525]
Neural language models (LMs) can be used to evaluate the truth of factual statements.
They can be queried for statement probabilities, or probed for internal representations of truthfulness.
Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs.
This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
arXiv Detail & Related papers (2023-11-27T18:59:14Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Exploring Qualitative Research Using LLMs [8.545798128849091]
This study aimed to compare and contrast the comprehension capabilities of humans and AI driven large language models.
We conducted an experiment with small sample of Alexa app reviews, initially classified by a human analyst.
LLMs were then asked to classify these reviews and provide the reasoning behind each classification.
arXiv Detail & Related papers (2023-06-23T05:21:36Z) - Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via
Debate [19.887103433032774]
Large language models (LLMs) have shown impressive performance in complex reasoning tasks.
This work explores testing LLMs' reasoning by engaging with them in a debate-like conversation.
We find that despite their impressive performance, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples.
arXiv Detail & Related papers (2023-05-22T15:47:31Z) - Can LLMs Capture Human Preferences? [5.683832910692926]
We explore the viability of Large Language Models (LLMs) in emulating human survey respondents and eliciting preferences.
We compare responses from LLMs across various languages and compare them to human responses, exploring preferences between smaller, sooner, and larger, later rewards.
Our findings reveal that both GPT models demonstrate less patience than humans, with GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike human decision-makers.
arXiv Detail & Related papers (2023-05-04T03:51:31Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.