Related papers: ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

URL: http://arxiv.org/abs/2403.17859v2
Date: Fri, 10 May 2024 17:15:24 GMT
Title: ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages
Authors: Bhawna Piryani, Jamshid Mozafari, Adam Jatowt,
Abstract summary: We introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years.
Score: 16.23717285493886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.

Related papers

QuAnTS: Question Answering on Time Series [50.91478742616324]
We propose a novel time series QA dataset, QuAnTS, for Question Answering on Time Series data.<n>We pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories.<n>We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments.
arXiv Detail & Related papers (2025-11-07T10:07:03Z)
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement [55.2439260314328]
Time Series Multi-Task Question Answering (Time-MQA) is a unified framework that enables natural language queries across multiple time series tasks. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $sim $200k question-answer pairs.
arXiv Detail & Related papers (2025-02-26T13:47:13Z)
PeerQA: A Scientific Question Answering Dataset from Peer Reviews [51.95579001315713]
We present PeerQA, a real-world, scientific, document-level Question Answering dataset. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks.
arXiv Detail & Related papers (2025-02-19T12:24:46Z)
Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z)
NewsQs: Multi-Source Question Generation for the Inquiring Mind [59.79288644158271]
We present NewsQs, a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles.
arXiv Detail & Related papers (2024-02-28T16:59:35Z)
A Dataset of Open-Domain Question Answering with Multiple-Span Answers [11.291635421662338]
Multi-span answer extraction, also known as the task of multi-span question answering (MSQA), is critical for real-world applications. There is a notable lack of publicly available MSQA benchmark in Chinese. We present CLEAN, a comprehensive Chinese multi-span question answering dataset.
arXiv Detail & Related papers (2024-02-15T13:03:57Z)
QASnowball: An Iterative Bootstrapping Framework for High-Quality Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball) QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z)
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images. It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
A Chinese Multi-type Complex Questions Answering Dataset over Wikidata [45.31495982252219]
Complex Knowledge Base Question Answering is a popular area of research in the past decade. Recent public datasets have led to encouraging results in this field, but are mostly limited to English. Few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases. We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges.
arXiv Detail & Related papers (2021-11-11T07:39:16Z)
ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Archival News Collections [20.07130742712862]
We present ArchivalQA, a large question answering dataset consisting of 1,067,056 question-answer pairs. We create four subparts of our dataset based on the question difficulty levels and the containment of temporal expressions.
arXiv Detail & Related papers (2021-09-08T05:21:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.