Large Language Models for Summarizing Czech Historical Documents and Beyond
- URL: http://arxiv.org/abs/2508.10368v1
- Date: Thu, 14 Aug 2025 06:07:49 GMT
- Title: Large Language Models for Summarizing Czech Historical Documents and Beyond
- Authors: Václav Tran, Jakub Šmíd, Jiří Martínek, Ladislav Lenc, Pavel Král,
- Abstract summary: summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information.<n>We employ large language models such as Mistral and mT5 to achieve new state-of-the-art results on the modern Czech summarization dataset SumeCzech.<n>We introduce a novel dataset called Posel od vCerchova for summarization of historical Czech documents with baseline results.
- Score: 1.4680035572775534
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.
Related papers
- Large Language Models for the Summarization of Czech Documents: From History to the Present [2.124799222903955]
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information.<n>This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets.<n>We address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5.<n>We also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech.
arXiv Detail & Related papers (2025-11-24T07:40:31Z) - ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents [0.0]
Authors introduce a new dataset of 300,000 full-text Persian papers obtained from the Ensani website.<n>They apply the ARMAN model, based on the Longformer architecture, to generate summaries.<n>Results demonstrate promising performance in Persian text summarization.
arXiv Detail & Related papers (2025-03-13T10:16:46Z) - BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism [30.267465719961585]
BenCzechMark (BCM) is the first comprehensive Czech language benchmark designed for large language models.<n>Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones.<n>These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word.
arXiv Detail & Related papers (2024-12-23T19:45:20Z) - RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation [2.3577273565334522]
RoLargeSum is a novel large-scale summarization dataset for the Romanian language.<n>It was crawled from various publicly available news websites from Romania and the Republic of Moldova.
arXiv Detail & Related papers (2024-12-15T21:27:33Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - CNewSum: A Large-scale Chinese News Summarization Dataset with
Human-annotated Adequacy and Deducibility Level [15.969302324314516]
We present a large-scale Chinese news summarization dataset CNewSum.
It consists of 304,307 documents and human-written summaries for the news feed.
Its test set contains adequacy and deducibility annotations for the summaries.
arXiv Detail & Related papers (2021-10-21T03:37:46Z) - Fine-tuning GPT-3 for Russian Text Summarization [77.34726150561087]
This paper showcases ruGPT3 ability to summarize texts, fine-tuning it on the corpora of Russian news with their corresponding human-generated summaries.
We evaluate the resulting texts with a set of metrics, showing that our solution can surpass the state-of-the-art model's performance without additional changes in architecture or loss function.
arXiv Detail & Related papers (2021-08-07T19:01:40Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.