Why and When LLM-Based Assistants Can Go Wrong: Investigating the
Effectiveness of Prompt-Based Interactions for Software Help-Seeking
- URL: http://arxiv.org/abs/2402.08030v1
- Date: Mon, 12 Feb 2024 19:49:58 GMT
- Title: Why and When LLM-Based Assistants Can Go Wrong: Investigating the
Effectiveness of Prompt-Based Interactions for Software Help-Seeking
- Authors: Anjali Khurana, Hari Subramonyam, Parmit K Chilana
- Abstract summary: Large Language Model (LLM) assistants have emerged as potential alternatives to search methods for helping users navigate software.
LLM assistants use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions.
- Score: 5.755004576310333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) assistants, such as ChatGPT, have emerged as
potential alternatives to search methods for helping users navigate complex,
feature-rich software. LLMs use vast training data from domain-specific texts,
software manuals, and code repositories to mimic human-like interactions,
offering tailored assistance, including step-by-step instructions. In this
work, we investigated LLM-generated software guidance through a within-subject
experiment with 16 participants and follow-up interviews. We compared a
baseline LLM assistant with an LLM optimized for particular software contexts,
SoftAIBot, which also offered guidelines for constructing appropriate prompts.
We assessed task completion, perceived accuracy, relevance, and trust.
Surprisingly, although SoftAIBot outperformed the baseline LLM, our results
revealed no significant difference in LLM usage and user perceptions with or
without prompt guidelines and the integration of domain context. Most users
struggled to understand how the prompt's text related to the LLM's responses
and often followed the LLM's suggestions verbatim, even if they were incorrect.
This resulted in difficulties when using the LLM's advice for software tasks,
leading to low task completion rates. Our detailed analysis also revealed that
users remained unaware of inaccuracies in the LLM's responses, indicating a gap
between their lack of software expertise and their ability to evaluate the
LLM's assistance. With the growing push for designing domain-specific LLM
assistants, we emphasize the importance of incorporating explainable,
context-aware cues into LLMs to help users understand prompt-based
interactions, identify biases, and maximize the utility of LLM assistants.
Related papers
- LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering [38.20696656193963]
We conducted an observational study with 22 participants using ChatGPT as a coding assistant in a non-trivial software engineering task.
We identified the cases where ChatGPT failed, their root causes, and the corresponding mitigation solutions used by users.
arXiv Detail & Related papers (2024-11-15T03:29:41Z) - Learning to Ask: When LLMs Meet Unclear Instruction [49.256630152684764]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.
We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.
We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z) - I Need Help! Evaluating LLM's Ability to Ask for Users' Support: A Case Study on Text-to-SQL Generation [60.00337758147594]
This study explores the proactive ability of LLMs to seek user support.
We propose metrics to evaluate the trade-off between performance improvements and user burden.
Our experiments show that without external feedback, many LLMs struggle to recognize their need for user support.
arXiv Detail & Related papers (2024-07-20T06:12:29Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations [26.340786701393768]
Interpretability tools that offer explanations in the form of a dialogue have demonstrated their efficacy in enhancing users' understanding.
Current solutions for dialogue-based explanations, however, often require external tools and modules and are not easily transferable to tasks they were not designed for.
We present an easily accessible tool that allows users to chat with any state-of-the-art large language model (LLM) about its behavior.
arXiv Detail & Related papers (2024-01-23T09:11:07Z) - Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM.
It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses.
We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.