ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges
- URL: http://arxiv.org/abs/2412.05206v1
- Date: Fri, 06 Dec 2024 17:35:52 GMT
- Title: ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges
- Authors: Kaustubh D. Dhole, Kai Shu, Eugene Agichtein,
- Abstract summary: Computational argumentation has become increasingly important in today's polarized environment.
We propose a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites.
Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation.
- Score: 23.179246872272362
- License:
- Abstract: Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.
Related papers
- Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings [14.355271969637139]
This work lifts several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation paradigm.
Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts.
arXiv Detail & Related papers (2024-12-19T18:57:11Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - Evaluating the Retrieval Component in LLM-Based Question Answering Systems [1.7013938542585922]
This study proposes a baseline for evaluating retrievers in Retrieval-Augmented Generation (RAG)-based chatbots.
Our findings demonstrate that this evaluation framework provides a better image of how the retriever performs.
Our method considers LLMs' strengths to ignore irrelevant contexts, as well as potential errors and hallucinations in their responses.
arXiv Detail & Related papers (2024-06-10T16:46:22Z) - Are Large Language Models Reliable Argument Quality Annotators? [7.966402845339264]
We study the potential of using state-of-the-art large language models (LLMs) as proxies for argument quality annotators.
Our findings highlight that LLMs can produce consistent annotations, with a moderately high agreement with human experts.
arXiv Detail & Related papers (2024-04-15T11:54:27Z) - Argument Quality Assessment in the Age of Instruction-Following Large Language Models [45.832808321166844]
A critical task in any such application is the assessment of an argument's quality.
We identify the diversity of quality notions and the subjectiveness of their perception as the main hurdles towards substantial progress on argument quality assessment.
We argue that the capabilities of instruction-following large language models (LLMs) to leverage knowledge across contexts enable a much more reliable assessment.
arXiv Detail & Related papers (2024-03-24T10:43:21Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Argue with Me Tersely: Towards Sentence-Level Counter-Argument
Generation [62.069374456021016]
We present the ArgTersely benchmark for sentence-level counter-argument generation.
We also propose Arg-LlaMA for generating high-quality counter-argument.
arXiv Detail & Related papers (2023-12-21T06:51:34Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.