AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
- URL: http://arxiv.org/abs/2404.11826v2
- Date: Sat, 01 Feb 2025 02:22:04 GMT
- Title: AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
- Authors: Minbeom Kim, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung,
- Abstract summary: We introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns.<n>We've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric.<n> Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation.
- Score: 28.732847229006264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
Related papers
- Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering [57.12316804290369]
Personalization is essential for adapting question answering systems to user-specific information needs.<n>We propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning.<n>PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement.
arXiv Detail & Related papers (2025-09-23T14:44:46Z) - CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering [1.0262304700896199]
We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs)<n>The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat.<n>Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance.
arXiv Detail & Related papers (2025-06-10T08:53:06Z) - Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z) - Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions [25.158868133182025]
We present a method for evaluating the output of generative large language models (LLMs)
Our scoring method correlates with the preferences of human experts.
We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model.
arXiv Detail & Related papers (2024-08-19T09:27:45Z) - Leveraging Topic Specificity and Social Relationships for Expert Finding in Community Question Answering Platforms [5.723916517485655]
We present TUEF, a Topic-oriented User-Interaction model for Expert Finding.
TUEF integrates content and social data by constructing a multi-layer graph.
Experiments show TUEF outperforms all competitors with a minimum performance boost of 42.42% in P@1, 32.73% in NDCG@3, 21.76% in R@5, and 29.81% in MRR.
arXiv Detail & Related papers (2024-07-04T15:50:18Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - QAGCF: Graph Collaborative Filtering for Q&A Recommendation [58.21387109664593]
Question and answer (Q&A) platforms usually recommend question-answer pairs to meet users' knowledge acquisition needs.
This makes user behaviors more complex, and presents two challenges for Q&A recommendation.
We introduce Question & Answer Graph Collaborative Filtering (QAGCF), a graph neural network model that creates separate graphs for collaborative and semantic views.
arXiv Detail & Related papers (2024-06-07T10:52:37Z) - K-ESConv: Knowledge Injection for Emotional Support Dialogue Systems via
Prompt Learning [83.19215082550163]
We propose K-ESConv, a novel prompt learning based knowledge injection method for emotional support dialogue system.
We evaluate our model on an emotional support dataset ESConv, where the model retrieves and incorporates knowledge from external professional emotional Q&A forum.
arXiv Detail & Related papers (2023-12-16T08:10:10Z) - A Critical Evaluation of Evaluations for Long-form Question Answering [48.51361567469683]
Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation.
We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
arXiv Detail & Related papers (2023-05-29T16:54:24Z) - Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback.
We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z) - FEBR: Expert-Based Recommendation Framework for beneficial and
personalized content [77.86290991564829]
We propose FEBR (Expert-Based Recommendation Framework), an apprenticeship learning framework to assess the quality of the recommended content.
The framework exploits the demonstrated trajectories of an expert (assumed to be reliable) in a recommendation evaluation environment, to recover an unknown utility function.
We evaluate the performance of our solution through a user interest simulation environment (using RecSim)
arXiv Detail & Related papers (2021-07-17T18:21:31Z) - An Empirical Study of Clarifying Question-Based Systems [15.767515065224016]
We conduct an online experiment by deploying an experimental system, which interacts with users by asking clarifying questions against a product repository.
We collect both implicit interaction behavior data and explicit feedback from users showing that: (a) users are willing to answer a good number of clarifying questions (11-21 on average), but not many more than that.
arXiv Detail & Related papers (2020-08-01T15:10:11Z) - Review-guided Helpful Answer Identification in E-commerce [38.276241153439955]
Product-specific community question answering platforms can greatly help address the concerns of potential customers.
The user-provided answers on such platforms often vary a lot in their qualities.
Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing.
arXiv Detail & Related papers (2020-03-13T11:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.