Large Language Models for the Summarization of Czech Documents: From History to the Present
- URL: http://arxiv.org/abs/2511.18848v1
- Date: Mon, 24 Nov 2025 07:40:31 GMT
- Title: Large Language Models for the Summarization of Czech Documents: From History to the Present
- Authors: Václav Tran, Jakub Šmíd, Ladislav Lenc, Jean-Pierre Salmon, Pavel Král,
- Abstract summary: Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information.<n>This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets.<n>We address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5.<n>We also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech.
- Score: 2.124799222903955
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.
Related papers
- Large Language Models for Summarizing Czech Historical Documents and Beyond [1.4680035572775534]
summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information.<n>We employ large language models such as Mistral and mT5 to achieve new state-of-the-art results on the modern Czech summarization dataset SumeCzech.<n>We introduce a novel dataset called Posel od vCerchova for summarization of historical Czech documents with baseline results.
arXiv Detail & Related papers (2025-08-14T06:07:49Z) - skLEP: A Slovak General Language Understanding Benchmark [0.030113849517062304]
skLEP is the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models.<n>To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated English NLU resources.<n>Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models.
arXiv Detail & Related papers (2025-06-26T17:35:04Z) - BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism [30.267465719961585]
BenCzechMark (BCM) is the first comprehensive Czech language benchmark designed for large language models.<n>Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones.<n>These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word.
arXiv Detail & Related papers (2024-12-23T19:45:20Z) - RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation [2.3577273565334522]
RoLargeSum is a novel large-scale summarization dataset for the Romanian language.<n>It was crawled from various publicly available news websites from Romania and the Republic of Moldova.
arXiv Detail & Related papers (2024-12-15T21:27:33Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents [12.493662336994106]
We present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain.
We train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese.
arXiv Detail & Related papers (2022-05-30T12:31:28Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.