AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
- URL: http://arxiv.org/abs/2410.01246v1
- Date: Wed, 2 Oct 2024 05:22:07 GMT
- Title: AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
- Authors: Xiaotian Lu, Jiyi Li, Koh Takeuchi, Hisashi Kashima,
- Abstract summary: We propose a method that leverages large language models to evaluate answers to open-ended questions.
We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4.
Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
- Score: 26.850344968677582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.
Related papers
- Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions [10.783827859678892]
We introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark.
This benchmark is derived from existing QA datasets, annotated with proprietary large language models.
It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge.
arXiv Detail & Related papers (2024-11-15T13:12:29Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations.
In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ)
We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB.
Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z) - SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs [85.54906813106683]
We propose a simple yet effective framework to enhance open-domain question answering (ODQA) with large language models (LLMs)
SuRe helps LLMs predict more accurate answers for a given question, which are well-supported by the summarized retrieval (SuRe)
Experimental results on diverse ODQA benchmarks demonstrate the superiority of SuRe, with improvements of up to 4.6% in exact match (EM) and 4.0% in F1 score over standard prompting approaches.
arXiv Detail & Related papers (2024-04-17T01:15:54Z) - Quality of Answers of Generative Large Language Models vs Peer Patients
for Interpreting Lab Test Results for Lay Patients: Evaluation Study [5.823006266363981]
Large language models (LLMs) have opened a promising avenue for patients to get their questions answered.
We generated responses to 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini.
We find that GPT-4's responses are more accurate, helpful, relevant, and safer.
arXiv Detail & Related papers (2024-01-23T22:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.