Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
- URL: http://arxiv.org/abs/2411.00492v1
- Date: Fri, 01 Nov 2024 10:06:52 GMT
- Title: Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
- Authors: Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen,
- Abstract summary: We present Multi-expert Prompting, a novel enhancement of ExpertPrompting to improve the large language model (LLM) generation.
Specifically, it guides an LLM to fulfill an input instruction by simulating multiple experts, aggregating their responses, and selecting the best among individual and aggregated responses.
Our evaluations demonstrate that Multi-expert Prompting significantly outperforms ExpertPrompting and comparable baselines in enhancing the truthfulness, factuality, informativeness, and usefulness of responses while reducing toxicity and hurtfulness.
- Score: 75.44218111729442
- License:
- Abstract: We present Multi-expert Prompting, a novel enhancement of ExpertPrompting (Xu et al., 2023), designed to improve the large language model (LLM) generation. Specifically, it guides an LLM to fulfill an input instruction by simulating multiple experts, aggregating their responses, and selecting the best among individual and aggregated responses. This process is performed in a single chain of thoughts through our seven carefully designed subtasks derived from the Nominal Group Technique (Ven and Delbecq, 1974), a well-established decision-making framework. Our evaluations demonstrate that Multi-expert Prompting significantly outperforms ExpertPrompting and comparable baselines in enhancing the truthfulness, factuality, informativeness, and usefulness of responses while reducing toxicity and hurtfulness. It further achieves state-of-the-art truthfulness by outperforming the best baseline by 8.69% with ChatGPT. Multi-expert Prompting is efficient, explainable, and highly adaptable to diverse scenarios, eliminating the need for manual prompt construction.
Related papers
- PromptHive: Bringing Subject Matter Experts Back to the Forefront with Collaborative Prompt Engineering for Educational Content Creation [8.313693615194309]
In this work, we introduce PromptHive, a collaborative interface for prompt authoring, designed to better connect domain knowledge with prompt engineering.
We conducted an evaluation study with ten subject matter experts in math and validated our design through two collaborative prompt-writing sessions and a learning gain study with 358 learners.
Our results elucidate the prompt iteration process and validate the tool's usability, enabling non-AI experts to craft prompts that generate content comparable to human-authored materials.
arXiv Detail & Related papers (2024-10-21T22:18:24Z) - X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation [47.96737683498274]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions.
This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, to address limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images.
arXiv Detail & Related papers (2024-08-27T16:10:21Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models [28.072184039405784]
We present POEM, a visual analytics system to facilitate efficient prompt engineering for large language models (LLMs)
The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts.
arXiv Detail & Related papers (2024-06-06T08:21:30Z) - PromptAgent: Strategic Planning with Language Models Enables
Expert-level Prompt Optimization [60.00631098364391]
PromptAgent is an optimization method that crafts expert-level prompts equivalent in quality to those handcrafted by experts.
Inspired by human-like trial-and-error exploration, PromptAgent induces precise expert-level insights and in-depth instructions.
We apply PromptAgent to 12 tasks spanning three practical domains.
arXiv Detail & Related papers (2023-10-25T07:47:01Z) - PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine [24.888093229577965]
We propose a simple, universal, and automatic method named PREFER to address the stated limitations.
Our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin.
arXiv Detail & Related papers (2023-08-23T09:46:37Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.