Related papers: Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

URL: http://arxiv.org/abs/2411.10163v1
Date: Fri, 15 Nov 2024 13:12:29 GMT
Title: Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions
Authors: Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen,
Abstract summary: We introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark. This benchmark is derived from existing QA datasets, annotated with proprietary large language models. It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge.
Score: 10.783827859678892
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, existing benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. In this paper, we introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark, focusing on compound questions with multiple sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge. Our assessment of eight open-source LLMs using Compound-QA reveals distinct patterns in their responses to compound questions, which are significantly poorer than those to non-compound questions. Additionally, we investigate various methods to enhance LLMs performance on compound questions. The results indicate that these approaches significantly improve the models' comprehension and reasoning abilities on compound questions.

Related papers

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z)
Decompositional Reasoning for Graph Retrieval with Large Language Models [1.034893617526558]
Large Language Models (LLMs) excel at many NLP tasks, but struggle with multi-hop reasoning and factual consistency.<n>We propose a novel retrieval approach that integrates textual knowledge graphs into the LLM reasoning process via query decomposition.<n>Our method decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific knowledge graph to guide answer generation.
arXiv Detail & Related papers (2025-06-16T11:44:28Z)
Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities [8.870297760635996]
Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks.<n>However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations.<n>Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges.
arXiv Detail & Related papers (2025-05-26T15:08:23Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
MMRel: A Relation Understanding Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a benchmark that features large-scale, high-quality, and diverse data on inter-object relations. MMRel is ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability.
arXiv Detail & Related papers (2024-06-13T13:51:59Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy. We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z)
Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark [69.3415799675046]
We introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA.
arXiv Detail & Related papers (2024-02-29T15:22:13Z)
Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data. Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.