Related papers: Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

URL: http://arxiv.org/abs/2407.07950v1
Date: Wed, 10 Jul 2024 18:00:05 GMT
Title: Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance
Authors: Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, Maarten Sap,
Abstract summary: reliance is influenced by numerous factors within the interactional context of a generation. We introduce Rel-A.I., an in situ, system-level evaluation approach to measure reliance.
Score: 73.19687314438133
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reconfiguration of human-LM interactions from simple sentence completions to complex, multi-domain, humanlike engagements necessitates new methodologies to understand how humans choose to rely on LMs. In our work, we contend that reliance is influenced by numerous factors within the interactional context of a generation, a departure from prior work that used verbalized confidence (e.g., "I'm certain the answer is...") as the key determinant of reliance. Here, we introduce Rel-A.I., an in situ, system-level evaluation approach to measure human reliance on LM-generated epistemic markers (e.g., "I think it's..", "Undoubtedly it's..."). Using this methodology, we measure reliance rates in three emergent human-LM interaction settings: long-term interactions, anthropomorphic generations, and variable subject matter. Our findings reveal that reliance is not solely based on verbalized confidence but is significantly affected by other features of the interaction context. Prior interactions, anthropomorphic cues, and subject domain all contribute to reliance variability. An expression such as, "I'm pretty sure it's...", can vary up to 20% in reliance frequency depending on its interactional context. Our work underscores the importance of context in understanding human reliance and offers future designers and researchers with a methodology to conduct such measurements.

Related papers

How large language models judge and influence human cooperation [82.07571393247476]
We assess how state-of-the-art language models judge cooperative actions.<n>We observe a remarkable agreement in evaluating cooperation against good opponents.<n>We show that the differences revealed between models can significantly impact the prevalence of cooperation.
arXiv Detail & Related papers (2025-06-30T09:14:42Z)
An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models [0.9856777842758593]
We present a neural symbolic framework that models the interactions between human and Large Language Models (LLMs) We define incompleteness and ambiguity in the questions as properties deducible from the messages exchanged in the interaction. Our results show multi-turn interactions are usually required for datasets which have a high proportion of incompleteness or ambiguous questions.
arXiv Detail & Related papers (2025-03-23T04:34:30Z)
HumT DumT: Measuring and controlling human-like language in LLMs [29.82328120944693]
Human-like language might improve user experience, but might also lead to deception, overreliance, and stereotyping.<n>We introduce HumT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM.<n>We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance.
arXiv Detail & Related papers (2025-02-18T20:04:09Z)
Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations [58.65755268815283]
Many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.
arXiv Detail & Related papers (2024-11-07T21:37:51Z)
LMLPA: Language Model Linguistic Personality Assessment [11.599282127259736]
Large Language Models (LLMs) are increasingly used in everyday life and research. measuring the personality of a given LLM is currently a challenge. This paper introduces the Language Model Linguistic Personality Assessment (LMLPA), a system designed to evaluate the linguistic personalities of LLMs.
arXiv Detail & Related papers (2024-10-23T07:48:51Z)
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness? [14.706111954807021]
We use psychological models and experiments designed to characterize human behavior to analyze large language models. We find that reinforcement learning from human feedback improves both honesty and helpfulness. GPT-4 Turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context.
arXiv Detail & Related papers (2024-02-11T19:13:26Z)
LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models [4.706971067968811]
We create a two-group population of large language models (LLMs) agents using a simple variability-inducing sampling algorithm. We administer personality tests and submit the agents to a collaborative writing task, finding that different profiles exhibit different degrees of personality consistency and linguistic alignment to their conversational partners.
arXiv Detail & Related papers (2024-02-05T11:05:20Z)
Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty [53.336235704123915]
We investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties. We find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses. We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations. Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty.
arXiv Detail & Related papers (2024-01-12T18:03:30Z)
AntEval: Evaluation of Social Interaction Competencies in LLM-Driven Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios. However, their capability in handling complex, multi-character social interactions has yet to be fully explored. We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z)
Affect Recognition in Conversations Using Large Language Models [9.689990547610664]
Affect recognition plays a pivotal role in human communication. This study investigates the capacity of large language models (LLMs) to recognise human affect in conversations.
arXiv Detail & Related papers (2023-09-22T14:11:23Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.