A Corpus for Large-Scale Phonetic Typology
- URL: http://arxiv.org/abs/2005.13962v1
- Date: Thu, 28 May 2020 13:03:51 GMT
- Title: A Corpus for Large-Scale Phonetic Typology
- Authors: Elizabeth Salesky, Eleanor Chodroff, Tiago Pimentel, Matthew Wiesner,
Ryan Cotterell, Alan W Black and Jason Eisner
- Abstract summary: We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
- Score: 112.19288631037055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major hurdle in data-driven research on typology is having sufficient data
in many languages to draw meaningful conclusions. We present VoxClamantis v1.0,
the first large-scale corpus for phonetic typology, with aligned segments and
estimated phoneme-level labels in 690 readings spanning 635 languages, along
with acoustic-phonetic measures of vowels and sibilants. Access to such data
can greatly facilitate investigation of phonetic typology at a large scale and
across many languages. However, it is non-trivial and computationally intensive
to obtain such alignments for hundreds of languages, many of which have few to
no resources presently available. We describe the methodology to create our
corpus, discuss caveats with current methods and their impact on the utility of
this data, and illustrate possible research directions through a series of case
studies on the 48 highest-quality readings. Our corpus and scripts are publicly
available for non-commercial use at https://voxclamantisproject.github.io.
Related papers
- Phonetically rich corpus construction for a low-resourced language [0.0]
This paper proposes a novel approach to create a textitcorpus with broad phonetic coverage for a low-resourced language.
Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution.
Using our algorithm, we achieve a 55.8% higher percentage of distinct triphones -- for samples of similar size.
arXiv Detail & Related papers (2024-02-08T16:36:11Z) - Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted
Sentiment Classification Benchmark [7.888702613862612]
This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models.
The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature.
We present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.
arXiv Detail & Related papers (2023-06-13T16:54:13Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
Recognition: the Arman-AV Dataset [2.594602184695942]
This paper presents a new multipurpose audio-visual dataset for Persian.
It consists of almost 220 hours of videos with 1760 corresponding speakers.
The dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition.
arXiv Detail & Related papers (2023-01-21T05:13:30Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.