Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains
- URL: http://arxiv.org/abs/2509.16531v1
- Date: Sat, 20 Sep 2025 04:43:24 GMT
- Title: Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains
- Authors: Junghwan Kim, Haotian Zhang, David Jurgens,
- Abstract summary: Authorship representation (AR) learning has demonstrated strong performance in authorship attribution tasks.<n>We introduce a novel method for multilingual AR learning that incorporates two key innovations.<n>Our model is trained on over 4.5 million authors across 36 languages and 13 domains.
- Score: 41.44674318564781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model's improved performance.
Related papers
- ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset [16.940142252787144]
We study multilingual data curation across thirteen languages.<n>In controlled bilingual experiments, improving data quality for any single language benefits others.<n>We operationalize this approach within an effort that produced a 20T-token pretraining corpus.
arXiv Detail & Related papers (2026-02-16T21:40:03Z) - UniBERT: Adversarial Training for Language-Universal Representations [2.294953003828613]
UniBERT is a compact multilingual language model that uses an innovative training framework that integrates three components: masked language modeling, adversarial training, and knowledge distillation.<n>UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks.
arXiv Detail & Related papers (2025-03-16T18:44:06Z) - RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs [13.563021984882704]
We introduce a novel, scalable method for generating high-quality multilingual feedback data.
Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B.
As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world's population.
arXiv Detail & Related papers (2024-07-02T17:42:30Z) - Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences.
It aligns the internal sentence representations across different languages via multilingual contrastive learning.
Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.