Evaluating Human-Language Model Interaction
- URL: http://arxiv.org/abs/2212.09746v5
- Date: Fri, 5 Jan 2024 22:09:26 GMT
- Title: Evaluating Human-Language Model Interaction
- Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus,
Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda
Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi
Bommasani, Michael Bernstein, Percy Liang
- Abstract summary: We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
- Score: 79.33022878034627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many real-world applications of language models (LMs), such as writing
assistance and code autocomplete, involve human-LM interaction. However, most
benchmarks are non-interactive in that a model produces output without human
involvement. To evaluate human-LM interaction, we develop a new framework,
Human-AI Language-based Interaction Evaluation (HALIE), that defines the
components of interactive systems and dimensions to consider when designing
evaluation metrics. Compared to standard, non-interactive evaluation, HALIE
captures (i) the interactive process, not only the final output; (ii) the
first-person subjective experience, not just a third-party assessment; and
(iii) notions of preference beyond quality (e.g., enjoyment and ownership). We
then design five tasks to cover different forms of interaction: social
dialogue, question answering, crossword puzzles, summarization, and metaphor
generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3
and AI21 Labs' Jurassic-1), we find that better non-interactive performance
does not always translate to better human-LM interaction. In particular, we
highlight three cases where the results from non-interactive and interactive
metrics diverge and underscore the importance of human-LM interaction for LM
evaluation.
Related papers
- Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? [30.540795619470483]
We present the first systematic investigation of multi-modal large language models (MLLMs) in generating interactive webpages.
Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark.
We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations.
Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time.
This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - BotEval: Facilitating Interactive Human Evaluation [21.99269491969255]
BotEval is an evaluation toolkit that enables human-bot interactions as part of the evaluation process.
We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process.
arXiv Detail & Related papers (2024-07-25T04:57:31Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Dialogue Evaluation with Offline Reinforcement Learning [2.580163308334609]
Task-oriented dialogue systems aim to fulfill user goals through natural language interactions.
They are ideally evaluated with human users, which is unattainable to do at every iteration of the development phase.
We propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus.
arXiv Detail & Related papers (2022-09-02T08:32:52Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.