Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about
them, their Similarity Estimates, and Baselines for Three Applications
- URL: http://arxiv.org/abs/2004.13945v2
- Date: Tue, 17 Aug 2021 05:54:21 GMT
- Title: Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about
them, their Similarity Estimates, and Baselines for Three Applications
- Authors: Rajesh Kumar Mundotiya, Manish Kumar Singh, Rahul Kapur, Swasti
Mishra, Anil Kumar Singh
- Abstract summary: Bhojpuri, Magahi, and Maithili are low-resource languages of the Purvanchal region of India.
We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels.
The results were compared with a standard Hindi corpus.
- Score: 0.6649753747542209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Corpus preparation for low-resource languages and for development of human
language technology to analyze or computationally process them is a laborious
task, primarily due to the unavailability of expert linguists who are native
speakers of these languages and also due to the time and resources required.
Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in
the north-eastern parts), are low-resource languages belonging to the
Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a
relatively high-resource language, which is why we compare with Hindi. We
collected corpora for these three languages from various sources and cleaned
them to the extent possible, without changing the data in them. The text
belongs to different domains and genres. We calculated some basic statistical
measures for these corpora at character, word, syllable, and morpheme levels.
These corpora were also annotated with parts-of-speech (POS) and chunk tags.
The basic statistical measures were both absolute and relative and were
exptected to indicate of linguistic properties such as morphological, lexical,
phonological, and syntactic complexities (or richness). The results were
compared with a standard Hindi corpus. For most of the measures, we tried to
the corpus size the same across the languages to avoid the effect of corpus
size, but in some cases it turned out that using the full corpus was better,
even if sizes were very different. Although the results are not very clear, we
try to draw some conclusions about the languages and the corpora. For POS
tagging and chunking, the BIS tagset was used to manually annotate the data.
The POS tagged data sizes are 16067, 14669 and 12310 sentences, respectively,
for Bhojpuri, Magahi and Maithili. The sizes for chunking are 9695 and 1954
sentences for Bhojpuri and Maithili, respectively.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit
to Hindi for Machine Translation [0.0]
The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi.
Data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature.
arXiv Detail & Related papers (2023-06-27T11:06:44Z) - Machine Translation by Projecting Text into the Same
Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity.
We verify the proposed approach by demonstrating experiments on similar language pairs.
We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Annotated Speech Corpus for Low Resource Indian Languages: Awadhi,
Bhojpuri, Braj and Magahi [2.84214511742034]
We develop a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi.
The total size of the corpus currently stands at approximately 18 hours.
We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic.
arXiv Detail & Related papers (2022-06-26T17:28:38Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset [0.0]
This paper presents the first major publicly available Marathi Sentiment Analysis dataset - L3MahaSent.
It is curated using tweets extracted from various Maharashtrian personalities' Twitter accounts.
Our dataset consists of 16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral.
arXiv Detail & Related papers (2021-03-21T14:22:13Z) - A Multilingual Parallel Corpora Collection Effort for Indian Languages [43.62422999765863]
We present sentence aligned parallel corpora across 10 Indian languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English.
The corpora are compiled from online sources which have content shared across languages.
arXiv Detail & Related papers (2020-07-15T14:00:18Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.