How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO
- URL: http://arxiv.org/abs/2404.13957v1
- Date: Mon, 22 Apr 2024 08:00:51 GMT
- Title: How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO
- Authors: Man Tik Ng, Hui Tung Tse, Jen-tse Huang, Jingjing Li, Wenxuan Wang, Michael R. Lyu,
- Abstract summary: We introduce ECHO, an evaluative framework inspired by the Turing test.
This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses.
We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
- Score: 55.25989137825992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The role-play ability of Large Language Models (LLMs) has emerged as a popular research direction. However, existing studies focus on imitating well-known public figures or fictional characters, overlooking the potential for simulating ordinary individuals. Such an oversight limits the potential for advancements in digital human clones and non-player characters in video games. To bridge this gap, we introduce ECHO, an evaluative framework inspired by the Turing test. This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. Notably, our framework focuses on emulating average individuals rather than historical or fictional figures, presenting a unique advantage to apply the Turing Test. We evaluated three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models, alongside the online application GPTs from OpenAI. Our results demonstrate that GPT-4 more effectively deceives human evaluators, and GPTs achieves a leading success rate of 48.3%. Furthermore, we investigated whether LLMs could discern between human-generated and machine-generated texts. While GPT-4 can identify differences, it could not determine which texts were human-produced. Our code and results of reproducing the role-playing LLMs are made publicly available via https://github.com/CUHK-ARISE/ECHO.
Related papers
- LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection [87.43727192273772]
It is often hard to tell whether a piece of text was human-written or machine-generated.
We present LLM-DetectAIve, designed for fine-grained detection.
It supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished.
arXiv Detail & Related papers (2024-08-08T07:43:17Z) - AI AI Bias: Large Language Models Favor Their Own Generated Content [0.1979158763744267]
We test whether large language models (LLMs) are biased towards text generated by LLMs over text authored by humans.
Our results show a consistent tendency for LLM-based AIs to prefer LLM-generated content.
This suggests the possibility of AI systems implicitly discriminating against humans, giving AI agents an unfair advantage.
arXiv Detail & Related papers (2024-07-09T13:15:14Z) - Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data [58.92110996840019]
We propose to enhance role-playing language models (RPLMs) via personality-indicative data.
Specifically, we leverage questions from psychological scales and distill advanced RPAs to generate dialogues that grasp the minds of characters.
Experimental results validate that RPLMs trained with our dataset exhibit advanced role-playing capabilities for both general and personality-related evaluations.
arXiv Detail & Related papers (2024-06-27T06:24:00Z) - People cannot distinguish GPT-4 from a human in a Turing test [0.913127392774573]
GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%)
Results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected.
arXiv Detail & Related papers (2024-05-09T04:14:09Z) - Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions? [7.308479353736709]
Large language models (LLMs) are increasingly used in robotics, especially for high-level action planning.
In this work, we test whether LLMs reproduce people's intuitions and communication in human-robot interaction scenarios.
We show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior higher than people.
arXiv Detail & Related papers (2024-03-08T22:23:23Z) - PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for
Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner.
Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z) - BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z) - Character-LLM: A Trainable Agent for Role-Playing [67.35139167985008]
Large language models (LLMs) can be used to serve as agents to simulate human behaviors.
We introduce Character-LLM that teach LLMs to act as specific people such as Beethoven, Queen Cleopatra, Julius Caesar, etc.
arXiv Detail & Related papers (2023-10-16T07:58:56Z) - Large Language Models Can Infer Psychological Dispositions of Social Media Users [1.0923877073891446]
We test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario.
Our results show an average correlation of r =.29 (range = [.22,.33]) between LLM-inferred and self-reported trait scores.
predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression.
arXiv Detail & Related papers (2023-09-13T01:27:48Z) - Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets.
This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.