Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
- URL: http://arxiv.org/abs/2510.07931v1
- Date: Thu, 09 Oct 2025 08:29:22 GMT
- Title: Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
- Authors: Madis Jürviste, Joonatan Jakobson,
- Abstract summary: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs)<n>The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.
Related papers
- Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect [17.504351782064113]
Meenzerisch is the dialect spoken in the German city of Mainz.<n>Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects.<n>This work presents the first research in the field of NLP that is explicitly focused on Meenzerisch.
arXiv Detail & Related papers (2026-02-18T20:29:02Z) - SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z) - Are Lexicon-Based Tools Still the Gold Standard for Valence Analysis in Low-Resource Flemish? [0.0]
Traditional lexicon-based tools such as LIWC and Pattern have long served as foundational instruments in this domain.<n>We first conducted a study involving approximately 25,000 textual responses from 102 Dutch-speaking participants.<n>We assessed the performance of three Dutch-specific LLMs in predicting these valence scores, and compared their outputs to those generated by LIWC and Pattern.<n>This study underscores the imperative for developing culturally and linguistically tailored models/tools that can adeptly handle the complexities of natural language use.
arXiv Detail & Related papers (2025-06-04T16:31:37Z) - Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardi's Zibaldone [4.795582035438343]
There is an urgent need of computational techniques able to adapt to the challenges of historical texts.<n>The rise of large language models (LLMs) has revolutionized natural language processing.<n>No thorough evaluation has been proposed for Italian texts.
arXiv Detail & Related papers (2025-05-26T15:16:48Z) - A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950 [0.0]
This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER)<n> Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes.
arXiv Detail & Related papers (2025-03-25T17:07:21Z) - Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP)
When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs.
It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z) - Are BabyLMs Second Language Learners? [48.85680614529188]
This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge.
Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective.
arXiv Detail & Related papers (2024-10-28T17:52:15Z) - AutoLLM-CARD: Towards a Description and Landscape of Large Language Models [11.72819342209987]
Large Language Models (LLMs) continue to emerge for diverse NLP tasks.
As more papers are published, researchers and developers face the challenge of information overload.
We propose a method for automatically generating LLM model cards from scientific publications.
arXiv Detail & Related papers (2024-09-25T15:15:57Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Large Language Models for Stemming: Promises, Pitfalls and Failures [34.91311006478368]
We investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding.
We compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text.
arXiv Detail & Related papers (2024-02-19T01:11:44Z) - Native Language Identification with Large Language Models [60.80452362519818]
We show that GPT models are proficient at NLI classification, with GPT-4 setting a new performance record of 91.7% on the benchmark11 test set in a zero-shot setting.
We also show that unlike previous fully-supervised settings, LLMs can perform NLI without being limited to a set of known classes.
arXiv Detail & Related papers (2023-12-13T00:52:15Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.