AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
- URL: http://arxiv.org/abs/2509.22996v1
- Date: Fri, 26 Sep 2025 23:11:17 GMT
- Title: AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts
- Authors: Jiří Milička, Anna Marklová, Václav Cvrček,
- Abstract summary: This article presents two corpora of English and Czech texts generated with large language models (LLMs)<n>The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically.<n> Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed on ensuring these resources are multi-genre and rich in terms of topics, authors, and text types, while maintaining comparability with existing human-created corpora. These generated corpora replicate reference human corpora: BE21 by Paul Baker, which is a modern version of the original Brown Corpus, and Koditex corpus that also follows the Brown Corpus tradition but in Czech. The new corpora were generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., they are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (the English part contains on average 864k tokens per model, 27M tokens altogether, the Czech partcontains on average 768k tokens per model, 21.5M tokens altogether). The corpora are freely available for download under the CC BY 4.0 license (the annotated data are under CC BY-NC-SA 4.0 licence) and are also accessible through the search interface of the Czech National Corpus.
Related papers
- DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation [74.85762984118024]
DIETA is a small, decoder-only Transformer model with 0.5 billion parameters.<n>We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs.<n>We release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles.
arXiv Detail & Related papers (2026-01-25T13:08:43Z) - The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models [41.865590656976316]
The German Commons is the largest collection of openly licensed German text to date.<n>It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text.
arXiv Detail & Related papers (2025-10-15T18:24:26Z) - Benchmark of stylistic variation in LLM-generated texts [0.0]
This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs)<n>Similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model.
arXiv Detail & Related papers (2025-09-12T12:12:20Z) - Dialectal and Low-Resource Machine Translation for Aromanian [44.99833362998488]
This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian.<n>The primary contribution is the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs.<n>To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation.
arXiv Detail & Related papers (2024-10-23T10:00:23Z) - KazParC: Kazakh Parallel Corpus for Machine Translation [3.1119394814248253]
We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish.
Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash.
arXiv Detail & Related papers (2024-03-28T13:19:16Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Carolina: a General Corpus of Contemporary Brazilian Portuguese with
Provenance, Typology and Versioning Information [0.629199190108771]
Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology.
Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types.
arXiv Detail & Related papers (2023-03-28T16:09:40Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.