Related papers: Open the Data! Chuvash Datasets

Related papers

iLSU-T: an Open Dataset for Uruguayan Sign Language Translation [2.0272430076690027]
iLSU T is an open dataset of interpreted Uruguayan Sign Language RGB videos with audio and text transcriptions.<n>This type of multimodal and curated data is paramount for developing novel approaches to understand or generate tools for sign language processing.
arXiv Detail & Related papers (2025-07-07T18:11:21Z)
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation [28.456351723077088]
This dataset is handcrafted in non-English languages first. Each of these source languages is represented among the 23 languages commonly used by half of the world's population.
arXiv Detail & Related papers (2025-02-06T18:56:37Z)
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages. We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z)
A Survey on Spoken Italian Datasets and Corpora [0.3222802562733787]
This survey provides a comprehensive analysis of 66 spoken Italian datasets. The datasets are categorized by speech type, source and context, and demographic and linguistic features. Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
arXiv Detail & Related papers (2025-01-11T14:33:57Z)
AzSLD: Azerbaijani Sign Language Dataset for Fingerspelling, Word, and Sentence Translation with Baseline Software [0.0]
The dataset was created within the framework of a vision-based AzSL translation project. AzSLD contains 30,000 videos, each carefully annotated with accurate sign labels and corresponding linguistic translations.
arXiv Detail & Related papers (2024-11-19T21:15:47Z)
A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages [0.0]
This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%.
arXiv Detail & Related papers (2024-06-04T09:58:29Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources [17.69148305999049]
We present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identify a geographically diverse set of target language groups for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons.
arXiv Detail & Related papers (2022-01-25T03:05:23Z)
Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z)
GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z)
Content4All Open Research Sign Language Translation Datasets [27.36513138911057]
We release six datasets comprised of 190 hours of footage on the larger domain of news. From this, 20 hours of footage have been annotated by Deaf experts and interpreters and is made publicly available for research purposes.
arXiv Detail & Related papers (2021-05-05T22:14:53Z)
The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.