A Survey on Spoken Italian Datasets and Corpora
- URL: http://arxiv.org/abs/2501.06557v1
- Date: Sat, 11 Jan 2025 14:33:57 GMT
- Title: A Survey on Spoken Italian Datasets and Corpora
- Authors: Marco Giordano, Claudia Rinaldi,
- Abstract summary: This survey provides a comprehensive analysis of 66 spoken Italian datasets.
The datasets are categorized by speech type, source and context, and demographic and linguistic features.
Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
- Score: 0.3222802562733787
- License:
- Abstract: Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.
Related papers
- WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z) - Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection.
We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts.
Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - SER_AMPEL: a multi-source dataset for speech emotion recognition of
Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER)
It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults.
The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Google Crowdsourced Speech Corpora and Related Open-Source Resources for
Low-Resource Languages and Dialects: An Overview [43.92114369646489]
We have released 38 datasets for building text-to-speech and automatic speech recognition applications.
The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
arXiv Detail & Related papers (2020-10-14T02:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.