Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
- URL: http://arxiv.org/abs/2602.14819v1
- Date: Mon, 16 Feb 2026 15:12:46 GMT
- Title: Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
- Authors: Matteo Rinaldi, Rossella Varvara, Viviana Patti,
- Abstract summary: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.<n>The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training.<n>The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction.
- Score: 2.609902663466295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
Related papers
- MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games [70.37904949359938]
We evaluate language models in multi-turn interactions using a suite of collaborative games that require effective communication about private information.<n>We find that language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario.<n>We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence.
arXiv Detail & Related papers (2026-02-27T17:13:20Z) - Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus [38.671466605067835]
We track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it.<n>We compile the proceedings from the first 10 editions of the CLiC-it conference into the CLiC-it Corpus.<n>Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time.
arXiv Detail & Related papers (2025-09-23T14:06:09Z) - Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG [55.258582772528506]
We investigate whether the mixture of different document languages impacts generation and citation in unintended ways.<n>Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English.<n>We find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone.
arXiv Detail & Related papers (2025-09-17T12:58:18Z) - BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities [25.55378198149251]
We present a multilingual dataset of Bengali political discourse (BTPD) collected from three online platforms.<n>This paper also provides a general overview of its topics and multilingual content.
arXiv Detail & Related papers (2025-06-07T14:43:35Z) - ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts [0.0]
We present the development and deployment of a linguistic corpus from Twitter posts in English.
The main goal was to create a fully annotated English corpus for linguistic analysis.
We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams.
arXiv Detail & Related papers (2024-07-22T04:48:04Z) - Neural Conversation Models and How to Rein Them in: A Survey of Failures
and Fixes [17.489075240435348]
Recent conditional language models are able to continue any kind of text source in an often seemingly fluent way.
From a linguistic perspective, contributing to a conversation is high.
Recent approaches try to tame the underlying language models at various intervention points.
arXiv Detail & Related papers (2023-08-11T12:07:45Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Mix and Match: An Empirical Study on Training Corpus Composition for
Polyglot Text-To-Speech (TTS) [3.57486761615991]
Training multilingual Neural Text-To-Speech (NTTS) models using only monolingual corpora has emerged as a popular way for building voice cloning based Polyglot NTTS systems.
It is essential to understand how the composition of the training corpora affects the quality of multilingual speech synthesis.
arXiv Detail & Related papers (2022-07-04T15:23:06Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.