Summarising Historical Text in Modern Languages
- URL: http://arxiv.org/abs/2101.10759v2
- Date: Wed, 27 Jan 2021 04:17:02 GMT
- Title: Summarising Historical Text in Modern Languages
- Authors: Xutan Peng, Yi Zheng, Chenghua Lin, Advaith Siddharthan
- Abstract summary: We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
- Score: 13.886432536330805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the task of historical text summarisation, where documents in
historical forms of a language are summarised in the corresponding modern
language. This is a fundamentally important routine to historians and digital
humanities researchers but has never been automated. We compile a high-quality
gold-standard text summarisation dataset, which consists of historical German
and Chinese news from hundreds of years ago summarised in modern German or
Chinese. Based on cross-lingual transfer learning techniques, we propose a
summarisation model that can be trained even with no cross-lingual (historical
to modern) parallel data, and further benchmark it against state-of-the-art
algorithms. We report automatic and human evaluations that distinguish the
historic to modern language summarisation task from standard cross-lingual
summarisation (i.e., modern to modern language), highlight the distinctness and
value of our dataset, and demonstrate that our transfer learning approach
outperforms standard cross-lingual benchmarks on this task.
Related papers
- Contrastive Entity Coreference and Disambiguation for Historical Texts [2.446672595462589]
Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases.
This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts.
arXiv Detail & Related papers (2024-06-21T18:22:14Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - Multilingual Event Extraction from Historical Newspaper Adverts [42.987470570997694]
This paper focuses on the under-explored task of event extraction from a novel domain of historical texts.
We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period.
We find that even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task.
arXiv Detail & Related papers (2023-05-18T12:40:41Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Deep Learning for Text Style Transfer: A Survey [71.8870854396927]
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text.
We present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017.
We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data.
arXiv Detail & Related papers (2020-11-01T04:04:43Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - Russian Natural Language Generation: Creation of a Language Modelling
Dataset and Evaluation with Modern Neural Architectures [0.0]
We provide a novel reference dataset for Russian language modeling.
We experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks.
We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.
arXiv Detail & Related papers (2020-05-05T20:20:25Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.