Related papers: Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

URL: http://arxiv.org/abs/2511.23056v1
Date: Fri, 28 Nov 2025 10:27:48 GMT
Title: Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts
Authors: Paulo J. N. Pinto, Armando J. Pinho, Diogo Pratas,
Abstract summary: This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models.<n>We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries.<n>On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines.
Score: 0.08749675983608168
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.

Related papers

Large language models for automated PRISMA 2020 adherence checking [0.01588808390680495]
We constructed a copyright-aware benchmark of 108 Creative Commons-licensed systematic reviews.<n>We evaluated ten large language models (LLMs) across five input formats.
arXiv Detail & Related papers (2025-11-20T02:08:13Z)
Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training [0.0]
This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification.<n>We achieve 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models.<n>Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data.
arXiv Detail & Related papers (2025-10-17T10:59:24Z)
Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures [87.75098311090642]
Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed.<n>We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres.
arXiv Detail & Related papers (2025-10-16T12:23:13Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy [0.0]
This article presents an experiment in fine-tuning a pretrained causal language model to restore missing or illegible characters in ancient Greek inscriptions and documentary papyri. Benchmarked against the state-of-the-art model (Ithaca), the instruction-tuned models excelled in text restoration. Results suggest that fine-tuning larger pretrained causal language models using instruction templates for emendations and conjectures holds promise.
arXiv Detail & Related papers (2024-09-20T19:49:45Z)
Text Sentiment Analysis and Classification Based on Bidirectional Gated Recurrent Units (GRUs) Model [6.096738978232722]
This paper explores the importance of text sentiment analysis and classification in the field of natural language processing. It proposes a new approach to sentiment analysis and classification based on the bidirectional gated recurrent units (GRUs) model.
arXiv Detail & Related papers (2024-04-26T02:40:03Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models. Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z)
Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing [52.24507547010127]
Cross-domain context-dependent semantic parsing is a new focus of research. We present a dynamic graph framework that effectively modelling contextual utterances, tokens, database schemas, and their complicated interaction as the conversation proceeds. The proposed framework outperforms all existing models by large margins, achieving new state-of-the-art performance on two large-scale benchmarks.
arXiv Detail & Related papers (2021-01-05T18:11:29Z)
Scoring Graspability based on Grasp Regression for Better Grasp Prediction [2.835565391455372]
Current state-of-the-art methods rely on deep neural networks trained to jointly predict a graspability score together with a regression of an offset with respect to grasp reference parameters. In this paper, we extend a state-of-the-art neural network with a scorer that evaluates the graspability of a given position, and introduce a novel loss function which correlates regression of grasp parameters with graspability score.
arXiv Detail & Related papers (2020-02-03T16:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.