Informal Persian Universal Dependency Treebank
- URL: http://arxiv.org/abs/2201.03679v1
- Date: Mon, 10 Jan 2022 22:33:07 GMT
- Title: Informal Persian Universal Dependency Treebank
- Authors: Roya Kabiri, Simin Karimi, Mihai Surdeanu
- Abstract summary: This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian.
We develop the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependency scheme.
- Score: 19.359203472636835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the phonological, morphological, and syntactic
distinctions between formal and informal Persian, showing that these two
variants have fundamental differences that cannot be attributed solely to
pronunciation discrepancies. Given that informal Persian exhibits particular
characteristics, any computational model trained on formal Persian is unlikely
to transfer well to informal Persian, necessitating the creation of dedicated
treebanks for this variety. We thus detail the development of the open-source
Informal Persian Universal Dependency Treebank, a new treebank annotated within
the Universal Dependencies scheme. We then investigate the parsing of informal
Persian by training two dependency parsers on existing formal treebanks and
evaluating them on out-of-domain data, i.e. the development set of our informal
treebank. Our results show that parsers experience a substantial performance
drop when we move across the two domains, as they face more unknown tokens and
structures and fail to generalize well. Furthermore, the dependency relations
whose performance deteriorates the most represent the unique properties of the
informal variant. The ultimate goal of this study that demonstrates a broader
impact is to provide a stepping-stone to reveal the significance of informal
variants of languages, which have been widely overlooked in natural language
processing tools across languages.
Related papers
- MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Dependency Annotation of Ottoman Turkish with Multilingual BERT [0.0]
This study introduces a pretrained large language model-based annotation methodology for the first dependency treebank in Ottoman Turkish.
The resulting treebank will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
arXiv Detail & Related papers (2024-02-22T17:58:50Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools
for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency.
For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages.
For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data.
In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - The Persian Dependency Treebank Made Universal [3.4410212782758047]
This treebank contains 29107 sentences.
Our data is more compatible with Universal Dependencies than the Persian Universal Dependency Treebank (Seraji et al., 2016)
Our delexicalized Persian-to-English transfer experiments show that a parsing model trained on our data is 2% more accurate than that of Seraji et al.
arXiv Detail & Related papers (2020-09-21T22:34:13Z) - I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical
Theory [0.0]
This paper is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language.
The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts.
arXiv Detail & Related papers (2020-07-11T13:34:44Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.