The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content
from 16 Million Historic Newspaper Pages in Chronicling America
- URL: http://arxiv.org/abs/2005.01583v1
- Date: Mon, 4 May 2020 15:51:13 GMT
- Title: The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content
from 16 Million Historic Newspaper Pages in Chronicling America
- Authors: Benjamin Charles Germain Lee, Jaime Mears, Eileen Jakeway, Meghan
Ferriter, Chris Adams, Nathan Yarasavage, Deborah Thomas, Kate Zwaard, Daniel
S. Weld
- Abstract summary: We introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons.
We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content.
We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus.
- Score: 10.446473806802578
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Chronicling America is a product of the National Digital Newspaper Program, a
partnership between the Library of Congress and the National Endowment for the
Humanities to digitize historic newspapers. Over 16 million pages of historic
American newspapers have been digitized for Chronicling America to date,
complete with high-resolution images and machine-readable METS/ALTO OCR. Of
considerable interest to Chronicling America users is a semantified corpus,
complete with extracted visual content and headlines. To accomplish this, we
introduce a visual content recognition model trained on bounding box
annotations of photographs, illustrations, maps, comics, and editorial cartoons
collected as part of the Library of Congress's Beyond Words crowdsourcing
initiative and augmented with additional annotations including those of
headlines and advertisements. We describe our pipeline that utilizes this deep
learning model to extract 7 classes of visual content: headlines, photographs,
illustrations, maps, comics, editorial cartoons, and advertisements, complete
with textual content such as captions derived from the METS/ALTO OCR, as well
as image embeddings for fast image similarity querying. We report the results
of running the pipeline on 16.3 million pages from the Chronicling America
corpus and describe the resulting Newspaper Navigator dataset, the largest
dataset of extracted visual content from historic newspapers ever produced. The
Newspaper Navigator dataset, finetuned visual content recognition model, and
all source code are placed in the public domain for unrestricted re-use.
Related papers
- Temporal Image Caption Retrieval Competition -- Description and Results [0.9999629695552195]
This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data.
The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years.
arXiv Detail & Related papers (2024-10-08T19:45:53Z) - 3DLNews: A Three-decade Dataset of US Local News Articles [49.1574468325115]
3DLNews is a novel dataset with local news articles from the United States spanning the period from 1996 to 2024.
It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states.
arXiv Detail & Related papers (2024-08-08T18:33:37Z) - Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts [2.885058600042882]
This work releases a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states.
It proposes layout segmentation as a precursor to existing optical character recognition (OCR) engines.
It provides a thorough and structured evaluation protocol for isolated layout segmentation and end-to-end OCR.
arXiv Detail & Related papers (2022-02-03T05:21:31Z) - Navigating the Mise-en-Page: Interpretive Machine Learning Approaches to
the Visual Layouts of Multi-Ethnic Periodicals [0.19116784879310028]
Our method combines Chronicling America's MARC data and the Newspaper Navigator machine learning dataset to identify the visual patterns of newspaper page layouts.
By analyzing high-dimensional visual similarity, we aim to better understand how editors spoke and protested through the layout of their papers.
arXiv Detail & Related papers (2021-09-03T21:10:38Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles.
We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism.
We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.