Related papers: GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

URL: http://arxiv.org/abs/2601.13711v1
Date: Tue, 20 Jan 2026 08:08:18 GMT
Title: GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark
Authors: Lotta Kiefer, Christoph Leiter, Sotaro Takeshita, Elena Schmidt, Steffen Eger,
Abstract summary: Authorship verification (AV) is the task of determining whether two texts were written by the same author.<n>GerAV is a comprehensive benchmark for German AV comprising over 600k labeled text pairs.
Score: 20.533795195003286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.

Related papers

Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs [1.4732811715354455]
Arctic-ABSA is a collection of powerful models for real-life aspect-based sentiment analysis (ABSA)<n>Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14.<n>A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance.
arXiv Detail & Related papers (2026-01-07T13:58:29Z)
Technical Report on the Pangram AI-Generated Text Classifier [0.14732811715354457]
We present Pangram Text, a transformer-based neural network trained to distinguish text written by large language models from text written by humans. We show that Pangram Text is not biased against nonnative English speakers and generalizes to domains and models unseen during training.
arXiv Detail & Related papers (2024-02-21T17:13:41Z)
Paloma: A Benchmark for Evaluating Language Model Fit [112.481957296585]
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.<n>We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains.
arXiv Detail & Related papers (2023-12-16T19:12:45Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Rationale-Guided Few-Shot Classification to Detect Abusive Language [5.977278650516324]
We propose RGFS (Rationale-Guided Few-Shot Classification) for abusive language detection. We introduce two rationale-integrated BERT-based architectures (the RGFS models) and evaluate our systems over five different abusive language datasets.
arXiv Detail & Related papers (2022-11-30T14:47:14Z)
Improving Retrieval Augmented Neural Machine Translation by Controlling Source and Fuzzy-Match Interactions [15.845071122977158]
We build on the idea of Retrieval Augmented Translation (RAT) where top-k in-domain fuzzy matches are found for the source sentence. We propose a novel architecture to control interactions between a source sentence and the top-k fuzzy target-language matches.
arXiv Detail & Related papers (2022-10-10T23:33:15Z)
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z)
Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.