Related papers: AMAQA: A Metadata-based QA Dataset for RAG Systems

AMAQA: A Metadata-based QA Dataset for RAG Systems

URL: http://arxiv.org/abs/2505.13557v1
Date: Mon, 19 May 2025 08:59:08 GMT
Title: AMAQA: A Metadata-based QA Dataset for RAG Systems
Authors: Davide Bruni, Marco Avvenuti, Nicola Tonellotto, Maurizio Tesconi,
Abstract summary: We present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata.<n>AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups.<n>We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context.
Score: 7.882922366782987
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented generation (RAG) systems are widely used in question-answering (QA) tasks, but current benchmarks lack metadata integration, hindering evaluation in scenarios requiring both textual data and external information. To address this, we present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata. The integration of metadata is especially important in fields that require rapid analysis of large volumes of data, such as cybersecurity and intelligence, where timely access to relevant information is critical. AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups, enriched with metadata such as timestamps, topics, emotional tones, and toxicity indicators, which enable precise and contextualized queries by filtering documents based on specific criteria. It also includes 450 high-quality QA pairs, making it a valuable resource for advancing research on metadata-driven QA and RAG systems. To the best of our knowledge, AMAQA is the first single-hop QA benchmark to incorporate metadata and labels such as topics covered in the messages. We conduct extensive tests on the benchmark, establishing a new standard for future research. We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context. Building on this, we explore several strategies to refine the LLM input by iterating over provided context and enriching it with noisy documents, achieving a further 3-point gain over the best baseline and a 14-point improvement over simple metadata filtering. The dataset is available at https://anonymous.4open.science/r/AMAQA-5D0D/

Related papers

PeerQA: A Scientific Question Answering Dataset from Peer Reviews [51.95579001315713]
We present PeerQA, a real-world, scientific, document-level Question Answering dataset.<n>The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP.<n>We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks.
arXiv Detail & Related papers (2025-02-19T12:24:46Z)
QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance [1.433758865948252]
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems.<n>RAG architecture is constructed to generate responses from the target document.<n>We introduce QuIM-RAG, a novel approach for the retrieval mechanism in our system.
arXiv Detail & Related papers (2025-01-06T01:07:59Z)
AmazonQAC: A Large-Scale, Naturalistic Query Autocomplete Dataset [14.544120039123934]
We introduce AmazonQAC, a new QAC dataset sourced from Amazon Search logs, comprising 395M samples. The dataset includes actual sequences of user-typed prefixes leading to final search terms, as well as session IDs and timestamps. We assess Prefix Trees, semantic retrieval, and Large Language Models (LLMs) with and without finetuning.
arXiv Detail & Related papers (2024-10-22T21:11:34Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS. We establish baselines with state-of-the-art summarization models. We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)
Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering [6.224211330728391]
Researchers produce thousands of scholarly documents containing valuable technical knowledge. Document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. We present a three-stage document QA approach: text extraction from PDF; evidence retrieval from extracted texts to form well-posed contexts; and QA to extract knowledge from contexts to return high-quality answers.
arXiv Detail & Related papers (2022-10-04T23:33:52Z)
Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation [32.83012699501051]
We improve generative data augmentation by formulating the data generation as context generation task. We cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. We demonstrate substantial improvements in performance in few-shot, zero-shot settings.
arXiv Detail & Related papers (2022-05-25T09:28:21Z)
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents. To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z)
HeteroQA: Learning towards Question-and-Answering through Multiple Information Sources via Heterogeneous Graph Modeling [50.39787601462344]
Community Question Answering (CQA) is a well-defined task that can be used in many scenarios, such as E-Commerce and online user community for special interests. Most of the CQA methods only incorporate articles or Wikipedia to extract knowledge and answer the user's question. We propose a question-aware heterogeneous graph transformer to incorporate the multiple information sources (MIS) in the user community to automatically generate the answer.
arXiv Detail & Related papers (2021-12-27T10:16:43Z)
More Than Reading Comprehension: A Survey on Datasets and Metrics of Textual Question Answering [7.776227936353711]
Textual Question Answering (QA) aims to provide precise answers to user's questions in natural language using unstructured data. In this paper, we survey 47 recent textual QA benchmark datasets and propose a new taxonomy from an application point of view.
arXiv Detail & Related papers (2021-09-25T02:36:53Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.