A ripple in time: a discontinuity in American history
- URL: http://arxiv.org/abs/2312.01185v4
- Date: Sat, 4 May 2024 09:15:43 GMT
- Title: A ripple in time: a discontinuity in American history
- Authors: Alexander Kolpakov, Igor Rivin,
- Abstract summary: In this note we use the State of the Union Address dataset from Kaggle to make some surprising observations.
Our main approach is using vector embeddings, such as BERT (DistilBERT) and GPT-2.
In our case, no model fine-tuning is required, and the pre-trained out-of-the-box GPT-2 model is enough.
- Score: 49.84018914962972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this note we use the State of the Union Address (SOTU) dataset from Kaggle to make some surprising (and some not so surprising) observations pertaining to the general timeline of American history, and the character and nature of the addresses themselves. Our main approach is using vector embeddings, such as BERT (DistilBERT) and GPT-2. While it is widely believed that BERT (and its variations) is most suitable for NLP classification tasks, we find out that GPT-2 in conjunction with nonlinear dimension reduction methods such as UMAP provide better separation and stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In our case, no model fine-tuning is required, and the pre-trained out-of-the-box GPT-2 model is enough. We also used a fine-tuned DistilBERT model for classification detecting which President delivered which address, with very good results (accuracy 93% - 95% depending on the run). An analogous task was performed to determine the year of writing, and we were able to pin it down to about 4 years (which is a single presidential term). It is worth noting that SOTU addresses provide relatively small writing samples (with about 8'000 words on average, and varying widely from under 2'000 words to more than 20'000), and that the number of authors is relatively large (we used SOTU addresses of 42 US presidents). This shows that the techniques employed turn out to be rather efficient, while all the computations described in this note can be performed using a single GPU instance of Google Colab. The accompanying code is available on GitHub.
Related papers
- Hypergraph Enhanced Knowledge Tree Prompt Learning for Next-Basket
Recommendation [50.55786122323965]
Next-basket recommendation (NBR) aims to infer the items in the next basket given the corresponding basket sequence.
HEKP4NBR transforms the knowledge graph (KG) into prompts, namely Knowledge Tree Prompt (KTP), to help PLM encode the Out-Of-Vocabulary (OOV) item IDs.
A hypergraph convolutional module is designed to build a hypergraph based on item similarities measured by an MoE model from multiple aspects.
arXiv Detail & Related papers (2023-12-26T02:12:21Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Boosting classification reliability of NLP transformer models in the
long run [0.0]
This paper compares different approaches to fine-tune a BERT model for a long-running classification task.
Our corpus contains over 8 million comments on COVID-19 vaccination in Hungary posted between September 2020 and December 2021.
arXiv Detail & Related papers (2023-02-20T14:46:54Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Offensive Language Detection with BERT-based models, By Customizing
Attention Probabilities [0.0]
We suggest a methodology to enhance the performance of the BERT-based models on the Offensive Language Detection' task.
We customize attention probabilities by changing the Attention Mask' input to create more efficacious word embeddings.
The most improvement was 2% and 10% for English and Persian languages, respectively.
arXiv Detail & Related papers (2021-10-11T10:23:44Z) - Automatic Face Understanding: Recognizing Families in Photos [6.131589026706621]
We build the largest database for kinship recognition.
Video dynamics, audio, and text captions can be used in the decision making of kinship recognition systems.
arXiv Detail & Related papers (2021-01-10T22:37:25Z) - Improving Semi-supervised Federated Learning by Reducing the Gradient
Diversity of Models [67.66144604972052]
Federated learning (FL) is a promising way to use the computing power of mobile devices while maintaining privacy of users.
We show that a critical issue that affects the test accuracy is the large gradient diversity of the models from different users.
We propose a novel grouping-based model averaging method to replace the FedAvg averaging method.
arXiv Detail & Related papers (2020-08-26T03:36:07Z) - Deep Contextual Embeddings for Address Classification in E-commerce [0.03222802562733786]
E-commerce customers in developing nations like India tend to follow no fixed format while entering shipping addresses.
It is imperative to understand the language of addresses, so that shipments can be routed without delays.
We propose a novel approach towards understanding customer addresses by deriving motivation from recent advances in Natural Language Processing (NLP)
arXiv Detail & Related papers (2020-07-06T19:06:34Z) - G2MF-WA: Geometric Multi-Model Fitting with Weakly Annotated Data [15.499276649167975]
In weak annotating, most of the manual annotations are supposed to be correct yet inevitably mixed with incorrect ones.
We propose a novel method to make full use of the WA data to boost the multi-model fitting performance.
arXiv Detail & Related papers (2020-01-20T04:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.