Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus
- URL: http://arxiv.org/abs/2509.19033v2
- Date: Wed, 24 Sep 2025 07:17:32 GMT
- Title: Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus
- Authors: Chiara Alzetta, Serena Auriemma, Alessandro Bondielli, Luca Dini, Chiara Fazzone, Alessio Miaschi, Martina Miliani, Marta Sartor,
- Abstract summary: We track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it.<n>We compile the proceedings from the first 10 editions of the CLiC-it conference into the CLiC-it Corpus.<n>Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time.
- Score: 38.671466605067835
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.
Related papers
- Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research [2.609902663466295]
We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language.<n>The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training.<n>The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction.
arXiv Detail & Related papers (2026-02-16T15:12:46Z) - Large-Scale Multidimensional Knowledge Profiling of Scientific Literature [46.15403461273178]
We compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025.<n>Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies.<n>These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions.
arXiv Detail & Related papers (2026-01-21T16:47:05Z) - Challenging the Abilities of Large Language Models in Italian: a Community Initiative [63.94242079171895]
"Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian.<n>It federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks.<n>We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities.
arXiv Detail & Related papers (2025-12-04T12:50:29Z) - PLLuM: A Family of Polish Large Language Models [91.61661675434216]
We presentuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language.<n>We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training.<n>We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants.
arXiv Detail & Related papers (2025-11-05T19:41:49Z) - What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric analysis [0.0]
This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production over 51 years, from 1974 to 2024.<n>The results indicate that in the 1980s and 1990s, linguistics and AI (AIL) research was not robust, characterized by unstable publication over time.<n>It concludes that linguistics and AI correlation is established at several levels, research centers, journals, and countries shaping AIL knowledge production and reshaping its future frontiers.
arXiv Detail & Related papers (2024-11-29T17:12:06Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities.
We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF)
We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z) - CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization [25.182666420286132]
Given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation.
This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching.
We introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news.
arXiv Detail & Related papers (2023-03-07T17:52:51Z) - A Survey of Code-switching: Linguistic and Social Perspectives for
Language Technologies [8.202739294785086]
We offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.
From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts.
From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data.
arXiv Detail & Related papers (2023-01-05T09:08:04Z) - A Survey on In-context Learning [77.78614055956365]
In-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP)
We first present a formal definition of ICL and clarify its correlation to related studies.
We then organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis.
arXiv Detail & Related papers (2022-12-31T15:57:09Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - The State and Fate of Linguistic Diversity and Inclusion in the NLP
World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.