Multilingual Event Extraction from Historical Newspaper Adverts
- URL: http://arxiv.org/abs/2305.10928v1
- Date: Thu, 18 May 2023 12:40:41 GMT
- Title: Multilingual Event Extraction from Historical Newspaper Adverts
- Authors: Nadav Borenstein and Natalia da Silva Perez and Isabelle Augenstein
- Abstract summary: This paper focuses on the under-explored task of event extraction from a novel domain of historical texts.
We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period.
We find that even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task.
- Score: 42.987470570997694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP methods can aid historians in analyzing textual materials in greater
volumes than manually feasible. Developing such methods poses substantial
challenges though. First, acquiring large, annotated historical datasets is
difficult, as only domain experts can reliably label them. Second, most
available off-the-shelf NLP models are trained on modern language texts,
rendering them significantly less effective when applied to historical corpora.
This is particularly problematic for less well studied tasks, and for languages
other than English. This paper addresses these challenges while focusing on the
under-explored task of event extraction from a novel domain of historical
texts. We introduce a new multilingual dataset in English, French, and Dutch
composed of newspaper ads from the early modern colonial period reporting on
enslaved people who liberated themselves from enslavement. We find that: 1)
even with scarce annotated data, it is possible to achieve surprisingly good
results by formulating the problem as an extractive QA task and leveraging
existing datasets and models for modern languages; and 2) cross-lingual
low-resource learning for historical languages is highly challenging, and
machine translation of the historical datasets to the considered target
languages is, in practice, often the best-performing solution.
Related papers
- A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - hmBERT: Historical Multilingual Language Models for Named Entity
Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts.
In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Restoring and Mining the Records of the Joseon Dynasty via Neural
Language Modeling and Machine Translation [20.497110880878544]
We present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism.
Our approach significantly improves the accuracy of the translation task than baselines without multi-task learning.
arXiv Detail & Related papers (2021-04-13T06:40:25Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.