A Massive Scale Semantic Similarity Dataset of Historical English
- URL: http://arxiv.org/abs/2306.17810v2
- Date: Thu, 24 Aug 2023 01:22:36 GMT
- Title: A Massive Scale Semantic Similarity Dataset of Historical English
- Authors: Emily Silcock, Melissa Dell
- Abstract summary: This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989.
We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement.
The HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time.
- Score: 3.8073142980733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A diversity of tasks use language models trained on semantic similarity data.
While there are a variety of datasets that capture semantic similarity, they
are either constructed from modern web data or are relatively small datasets
created in the past decade by human annotators. This study utilizes a novel
source, newly digitized articles from off-copyright, local U.S. newspapers, to
assemble a massive-scale semantic similarity dataset spanning 70 years from
1920 to 1989 and containing nearly 400M positive semantic similarity pairs.
Historically, around half of articles in U.S. local newspapers came from
newswires like the Associated Press. While local papers reproduced articles
from the newswire, they wrote their own headlines, which form abstractive
summaries of the associated articles. We associate articles and their headlines
by exploiting document layouts and language understanding. We then use deep
neural methods to detect which articles are from the same underlying source, in
the presence of substantial noise and abridgement. The headlines of reproduced
articles form positive semantic similarity pairs. The resulting publicly
available HEADLINES dataset is significantly larger than most existing semantic
similarity datasets and covers a much longer span of time. It will facilitate
the application of contrastively trained semantic similarity models to a
variety of tasks, including the study of semantic change across space and time.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - News Deja Vu: Connecting Past and Present with Semantic Search [2.446672595462589]
News Deja Vu is a novel semantic search tool for historical news articles.
We show how it can be deployed to a massive scale corpus of historical, open-source news articles.
arXiv Detail & Related papers (2024-06-21T18:50:57Z) - Newswire: A Large-Scale Structured Database of a Century of Historical News [3.562368079040469]
Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world.
We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers.
The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977.
arXiv Detail & Related papers (2024-06-13T16:20:05Z) - American Stories: A Large-Scale Structured Text Dataset of Historical
U.S. Newspapers [7.161822501147275]
This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images.
It applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection.
The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes.
arXiv Detail & Related papers (2023-08-24T00:24:42Z) - PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity [5.439505575097552]
Cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset.
For Persian, which is one of the low resource languages, the need for a model that can understand the context of two languages is felt more than ever.
In this article, the corpus of semantic similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts.
arXiv Detail & Related papers (2023-05-13T11:02:50Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Learning language variations in news corpora through differential
embeddings [0.0]
We show that a model with a central word representation and a slice-dependent contribution can learn word embeddings from different corpora simultaneously.
We show that it can capture both temporal dynamics in the yearly slices of each corpus, and language variations between US and UK English in a curated multi-source corpus.
arXiv Detail & Related papers (2020-11-13T14:50:08Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.