GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing
- URL: http://arxiv.org/abs/2502.06494v1
- Date: Mon, 10 Feb 2025 14:11:32 GMT
- Title: GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing
- Authors: Jinhao Duan, Xinyu Zhao, Zhuoxuan Zhang, Eunhye Ko, Lily Boddy, Chenan Wang, Tianhao Li, Alexander Rasgon, Junyuan Hong, Min Kyung Lee, Chenxi Yuan, Qi Long, Ying Ding, Tianlong Chen, Kaidi Xu,
- Abstract summary: Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering.
In this study, we first characterize LLM-guided conversation into three fundamental components: Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement.
We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality.
- Score: 73.8469700907927
- License:
- Abstract: Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation's objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.
Related papers
- LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation [24.103034843158717]
We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs)
This approach leverages multi-turn interactions where the interviewer actively provides feedback on responses and poses follow-up questions to the evaluated LLM.
We apply the framework to evaluate six models on the MATH and DepthQA tasks.
arXiv Detail & Related papers (2024-12-10T15:00:32Z) - Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents.
In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z) - Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation [15.718288693929019]
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks.
We study whether LLMs can be used as substitutes for human annotators.
We find that LLMs outperform current automatic measures for system-level evaluation but still struggle to provide satisfactory explanations.
arXiv Detail & Related papers (2024-05-22T15:56:52Z) - Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk [11.706292228586332]
Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging.
We propose a more effective method for data collection through LLMs engaging in a conversation in various roles.
This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning.
arXiv Detail & Related papers (2024-01-10T09:49:10Z) - Let the LLMs Talk: Simulating Human-to-Human Conversational QA via
Zero-Shot LLM-to-LLM Interactions [19.365615476223635]
Conversational question-answering systems aim to create interactive search systems that retrieve information by interacting with users.
Existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher)
We propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions.
arXiv Detail & Related papers (2023-12-05T17:38:02Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z) - Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue
Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response.
We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English.
Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.