Related papers: Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

URL: http://arxiv.org/abs/2406.13232v1
Date: Wed, 19 Jun 2024 05:43:02 GMT
Title: Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
Authors: Akchay Srivastava, Atif Memon,
Abstract summary: Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. High-quality datasets are used to train models on realistic scenarios. Standardized metrics facilitate comparisons between different ODQA systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.

Related papers

Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z)
Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis) [0.0]
We train our models to answer diverse questions by instilling an ability to reason over a retrieved context. We acquire context from two knowledge sources; a Wikipedia corpus queried using a multi-hop dense retrieval system with novel extensions, and from rationales generated from a larger Language Model optimised to run in a lower resource environment.
arXiv Detail & Related papers (2024-11-25T23:25:34Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence. This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z)
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles. We employ automatic and manual curation to create the dataset. SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z)
Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [36.22392593103493]
Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets. Existing surveys overlook an in-depth exploration of the fine-tuning phase. We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods.
arXiv Detail & Related papers (2024-06-20T08:58:58Z)
A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning [51.7818820745221]
Underwater image enhancement (UIE) presents a significant challenge within computer vision research. Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent.
arXiv Detail & Related papers (2024-05-30T04:46:40Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence. We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions. We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z)
Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner. The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z)
Investigating Fairness Disparities in Peer Review: A Language Model Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs) We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date. We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
Conversational Question Answering: A Survey [18.447856993867788]
This survey is an effort to present a comprehensive review of the state-of-the-art research trends of Conversational Question Answering (CQA) Our findings show that there has been a trend shift from single-turn to multi-turn QA which empowers the field of Conversational AI from different perspectives.
arXiv Detail & Related papers (2021-06-02T01:06:34Z)
Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering [62.88322725956294]
We review the latest research trends in OpenQA, with particular attention to systems that incorporate neural MRC techniques. We introduce modern OpenQA architecture named Retriever-Reader'' and analyze the various systems that follow this architecture. We then discuss key challenges to developing OpenQA systems and offer an analysis of benchmarks that are commonly used.
arXiv Detail & Related papers (2021-01-04T04:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.