Is text normalization relevant for classifying medieval charters?
- URL: http://arxiv.org/abs/2408.16446v1
- Date: Thu, 29 Aug 2024 11:19:57 GMT
- Title: Is text normalization relevant for classifying medieval charters?
- Authors: Florian Atzenhofer-Baumgartner, Tamás Kovács,
- Abstract summary: This study examines the impact of historical text normalization on the classification of medieval charters.
Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating.
Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.
Related papers
- Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z) - ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - Adapting PromptORE for Modern History: Information Extraction from Hispanic Monarchy Documents of the XVIth Century [2.490441444378203]
We introduce an adaptation of PromptORE to extract relations from specialized documents, namely digital transcripts of trials from the Spanish Inquisition.
Our approach involves fine-tuning transformer models with their pretraining objective on the data they will perform inference.
Our results show a substantial improvement in accuracy -up to a 50% improvement with our Biased PromptORE models.
arXiv Detail & Related papers (2024-05-24T13:39:47Z) - On the Efficacy of Sampling Adapters [82.5941326570812]
We propose a unified framework for understanding sampling adapters.
We argue that the shift they enforce can be viewed as a trade-off between precision and recall.
We find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution.
arXiv Detail & Related papers (2023-07-07T17:59:12Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Transformers are Short Text Classifiers: A Study of Inductive Short Text
Classifiers on Benchmarks and Real-world Datasets [2.9443230571766854]
Short text classification is a crucial and challenging aspect of Natural Language Processing.
In recent short text research, State of the Art (SOTA) methods for traditional text classification have been unexploited.
Our experiments unambiguously demonstrate that Transformers achieve SOTA accuracy on short text classification tasks.
arXiv Detail & Related papers (2022-11-30T10:25:24Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Conical Classification For Computationally Efficient One-Class Topic
Determination [0.0]
We propose a Conical classification approach to identify documents that relate to a particular topic.
We show in our analysis that our approach has higher predictive power on our datasets, and is also faster to compute.
arXiv Detail & Related papers (2021-10-31T01:27:12Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Effect of Post-processing on Contextualized Word Representations [20.856802441794162]
Post-processing of static embedding has beenshown to improve their performance on both lexical and sequence-level tasks.
We question the usefulness of post-processing for contextualized embeddings obtained from different layers of pre-trained language models.
arXiv Detail & Related papers (2021-04-15T13:40:42Z) - Predicting the Humorousness of Tweets Using Gaussian Process Preference
Learning [56.18809963342249]
We present a probabilistic approach that learns to rank and rate the humorousness of short texts by exploiting human preference judgments and automatically sourced linguistic annotations.
We report system performance for the campaign's two subtasks, humour detection and funniness score prediction, and discuss some issues arising from the conversion between the numeric scores used in the HAHA@IberLEF 2019 data and the pairwise judgment annotations required for our method.
arXiv Detail & Related papers (2020-08-03T13:05:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.