American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers
- URL: http://arxiv.org/abs/2308.12477v1
- Date: Thu, 24 Aug 2023 00:24:42 GMT
- Title: American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers
- Authors: Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora,
Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
- Abstract summary: This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
- Score: 7.161822501147275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing full text datasets of U.S. public domain newspapers do not recognize
the often complex layouts of newspaper scans, and as a result the digitized
content scrambles texts from articles, headlines, captions, advertisements, and
other layout regions. OCR quality can also be low. This study develops a novel,
deep learning pipeline for extracting full article texts from newspaper images
and applies it to the nearly 20 million scans in Library of Congress's public
domain Chronicling America collection. The pipeline includes layout detection,
legibility classification, custom OCR, and association of article texts
spanning multiple bounding boxes. To achieve high scalability, it is built with
efficient architectures designed for mobile phones. The resulting American
Stories dataset provides high quality data that could be used for pre-training
a large language model to achieve better understanding of historical English
and historical world knowledge. The dataset could also be added to the external
database of a retrieval-augmented language model to make historical information
- ranging from interpretations of political events to minutiae about the lives
of people's ancestors - more widely accessible. Furthermore, structured article
texts facilitate using transformer-based methods for popular social science
applications like topic classification, detection of reproduced content, and
news story clustering. Finally, American Stories provides a massive silver
quality dataset for innovating multimodal layout analysis models and other
multimodal applications.
Related papers
- A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - A New Korean Text Classification Benchmark for Recognizing the Political
Intents in Online Newspapers [6.633601941627045]
We present a novel Korean text classification dataset that contains various articles.
Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea.
To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems.
arXiv Detail & Related papers (2023-11-03T04:59:55Z) - LoRaLay: A Multilingual and Multimodal Dataset for Long Range and
Layout-Aware Summarization [19.301567079372436]
Text Summarization is a popular task and an active area of research for the Natural Language Processing community.
All publicly available summarization datasets only provide plain text content.
We present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/Lay information.
arXiv Detail & Related papers (2023-01-26T18:50:54Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z) - The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content
from 16 Million Historic Newspaper Pages in Chronicling America [10.446473806802578]
We introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons.
We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content.
We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus.
arXiv Detail & Related papers (2020-05-04T15:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.