ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and
Development
- URL: http://arxiv.org/abs/2307.08720v1
- Date: Mon, 17 Jul 2023 04:19:30 GMT
- Title: ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and
Development
- Authors: Yanir Marmor, Kinneret Misgav and Yair Lifshitz
- Abstract summary: ivrit.ai offers a substantial compilation of Hebrew speech across various contexts.
The dataset stands out for its legal accessibility, permitting use at no cost.
Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing
the distinct lack of extensive, high-quality resources for advancing Automated
Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and
a over a thousand diverse speakers, ivrit.ai offers a substantial compilation
of Hebrew speech across various contexts. It is delivered in three forms to
cater to varying research needs: raw unprocessed audio; data post-Voice
Activity Detection, and partially transcribed data. The dataset stands out for
its legal accessibility, permitting use at no cost, thereby serving as a
crucial resource for researchers, developers, and commercial entities. ivrit.ai
opens up numerous applications, offering vast potential to enhance AI
capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby
advancing Hebrew's standing in AI research and technology.
Related papers
- The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings [0.0]
We present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages.
We focus on three Slavic languages, namely Croatian, Polish, and Serbian.
The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts.
arXiv Detail & Related papers (2024-09-23T10:12:18Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages.
We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models.
Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - ParaShoot: A Hebrew Question Answering Dataset [22.55706811131828]
ParaShoot is the first question-answering dataset in modern Hebrew.
We provide the first baseline results using recently-released BERT-style models for Hebrew.
arXiv Detail & Related papers (2021-09-23T11:59:38Z) - AI4D -- African Language Program [0.21960481478626018]
This work details the AI4D - African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets.
Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets.
arXiv Detail & Related papers (2021-04-06T13:51:16Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.