Open the Data! Chuvash Datasets
- URL: http://arxiv.org/abs/2407.11982v1
- Date: Fri, 31 May 2024 07:51:19 GMT
- Title: Open the Data! Chuvash Datasets
- Authors: Nikolay Plotnikov, Alexander Antonov,
- Abstract summary: We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
- Score: 50.59120569845975
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this paper, we introduce four comprehensive datasets for the Chuvash language, aiming to support and enhance linguistic research and technological development for this underrepresented language. These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset. Each dataset is meticulously curated to serve various applications such as machine translation, linguistic analysis, and speech recognition, providing valuable resources for scholars and developers working with the Chuvash language. Together, these datasets represent a significant step towards preserving and promoting the Chuvash language in the digital age.
Related papers
- A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo.
We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers.
We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z) - The Evolution of Darija Open Dataset: Introducing Version 2 [0.0]
DODa stands as the largest collaborative project of its kind for Darija-English translation.
This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements.
arXiv Detail & Related papers (2024-05-14T15:08:32Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Documenting Geographically and Contextually Diverse Data Sources: The
BigScience Catalogue of Language Data and Resources [17.69148305999049]
We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative.
We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources.
To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
arXiv Detail & Related papers (2022-01-25T03:05:23Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - Content4All Open Research Sign Language Translation Datasets [27.36513138911057]
We release six datasets comprised of 190 hours of footage on the larger domain of news.
From this, 20 hours of footage have been annotated by Deaf experts and interpreters and is made publicly available for research purposes.
arXiv Detail & Related papers (2021-05-05T22:14:53Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.