Related papers: CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

URL: http://arxiv.org/abs/2603.03884v1
Date: Wed, 04 Mar 2026 09:35:47 GMT
Title: CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents
Authors: Martin Kostelník, Michal Hradiš, Martin Dočekal,
Abstract summary: We introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans.<n>We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset.<n>Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.

Related papers

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents [38.797311337915175]
SwissGov-RSD is the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition.<n>It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian.<n>We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark.
arXiv Detail & Related papers (2025-12-08T13:17:27Z)
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks [13.836108236883002]
We introduce HUME: Human Evaluation Framework for Text Embeddings.<n>We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity.<n>Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model.
arXiv Detail & Related papers (2025-10-11T06:56:53Z)
Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages [17.028968054304947]
MSumBench is a multi-dimensional, multi-domain evaluation of summarization in English and Chinese.<n>By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages.
arXiv Detail & Related papers (2025-05-31T13:12:35Z)
A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation [20.675242617417677]
Cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation.
arXiv Detail & Related papers (2023-06-22T14:31:18Z)
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references. We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z)
Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document. We also simultaneously cluster users, removing the need for post-hoc cluster estimation. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.