LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
- URL: http://arxiv.org/abs/2003.06499v1
- Date: Fri, 13 Mar 2020 22:24:14 GMT
- Title: LSCP: Enhanced Large Scale Colloquial Persian Language Understanding
- Authors: Hadi Abdi Khojasteh, Ebrahim Ansari, Mahdi Bohlouli
- Abstract summary: "Large Scale Colloquial Persian dataset" aims to describe the colloquial language of low-resourced languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
- Score: 2.7249643773851724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language recognition has been significantly advanced in recent years by means
of modern machine learning methods such as deep learning and benchmarks with
rich annotations. However, research is still limited in low-resource formal
languages. This consists of a significant gap in describing the colloquial
language especially for low-resourced ones such as Persian. In order to target
this gap for low resource languages, we propose a "Large Scale Colloquial
Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic
taxonomy that focuses on multi-task informal Persian language understanding as
a comprehensive problem. This encompasses the recognition of multiple semantic
aspects in the human-level sentences, which naturally captures from the
real-world sentences. We believe that further investigations and processing, as
well as the application of novel algorithms and methods, can strengthen
enriching computerized understanding and processing of low resource languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets
annotated with parsing tree, part-of-speech tags, sentiment polarity and
translation in five different languages.
Related papers
- The Zeno's Paradox of `Low-Resource' Languages [20.559416975723142]
We show how several interacting axes contribute to low-resourcedness' of a language.
We hope our work elicits explicit definitions of the terminology when it is used in papers.
arXiv Detail & Related papers (2024-10-28T08:05:34Z) - Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment [13.997006139875563]
Cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models.
We introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models.
arXiv Detail & Related papers (2024-04-03T05:58:53Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification [0.0]
A system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed.
It reached a medium performance level in English, the language for which it is easy to develop deep learning approaches.
It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches.
arXiv Detail & Related papers (2022-02-05T08:09:09Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Automatically Identifying Language Family from Acoustic Examples in Low
Resource Scenarios [48.57072884674938]
We propose a method to analyze language similarity using deep learning.
Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings.
arXiv Detail & Related papers (2020-12-01T22:44:42Z) - Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages [24.775371434410328]
We explore techniques exploiting the qualities of morphologically rich languages (MRLs)
We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
arXiv Detail & Related papers (2020-03-09T21:30:55Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.