Related papers: NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach

URL: http://arxiv.org/abs/2502.04351v1
Date: Tue, 04 Feb 2025 16:54:23 GMT
Title: NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach
Authors: Torsten Hiltmann, Martin Dröge, Nicole Dresselhaus, Till Grallert, Melanie Althage, Paul Bayer, Sophie Eckenstaler, Koray Mendi, Jascha Marijn Schmitz, Philipp Schneider, Wiebke Sczeponik, Anica Skibba,
Abstract summary: We show how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks for NER in historical documents.<n>Our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools.
Score: 0.03187482513047917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Named entity recognition (NER) is a core task for historical research in automatically establishing all references to people, places, events and the like. Yet, do to the high linguistic and genre diversity of sources, only limited canonisation of spellings, the level of required historical domain knowledge, and the scarcity of annotated training data, established approaches to natural language processing (NLP) have been both extremely expensive and yielded only unsatisfactory results in terms of recall and precision. Our paper introduces a new approach. We demonstrate how readily-available, state-of-the-art LLMs significantly outperform two leading NLP frameworks, spaCy and flair, for NER in historical documents by seven to twentytwo percent higher F1-Scores. Our ablation study shows how providing historical context to the task and a bit of persona modelling that turns focus away from a purely linguistic approach are core to a successful prompting strategy. We also demonstrate that, contrary to our expectations, providing increasing numbers of examples in few-shot approaches does not improve recall or precision below a threshold of 16-shot. In consequence, our approach democratises access to NER for all historians by removing the barrier of scripting languages and computational skills required for established NLP tools and instead leveraging natural language prompts and consumer-grade tools and frontends.

Related papers

The OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing [2.7471068141502]
We aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies.<n>In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data.<n>This research contributes a novel approach into building models that can easily adopt under-represented alphabets.
arXiv Detail & Related papers (2025-06-07T11:05:33Z)
A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950 [0.0]
This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes.
arXiv Detail & Related papers (2025-03-25T17:07:21Z)
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan [0.1979158763744267]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan.
arXiv Detail & Related papers (2025-03-10T20:16:01Z)
Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek [2.3499129784547663]
We evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models on seven core NLP tasks with dataset availability.<n>Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training.<n>Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering emphlong legal texts.
arXiv Detail & Related papers (2025-01-22T12:06:16Z)
A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document. Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative. Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z)
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding. There is no publicly available NLI corpus for the Romanian language. We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking. It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z)
The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures [0.0]
Natural Language Processing models have achieved phenomenal success in linguistic and semantic tasks. Recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes. Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy.
arXiv Detail & Related papers (2021-03-23T22:38:20Z)
Making Pre-trained Language Models Better Few-shot Learners [11.90626040104822]
Recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples.
arXiv Detail & Related papers (2020-12-31T17:21:26Z)
Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns. We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP) In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.