Informal Persian Universal Dependency Treebank
- URL: http://arxiv.org/abs/2201.03679v1
- Date: Mon, 10 Jan 2022 22:33:07 GMT
- Title: Informal Persian Universal Dependency Treebank
- Authors: Roya Kabiri, Simin Karimi, Mihai Surdeanu
- Abstract summary: This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian.
We develop the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependency scheme.
- Score: 19.359203472636835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the phonological, morphological, and syntactic
distinctions between formal and informal Persian, showing that these two
variants have fundamental differences that cannot be attributed solely to
pronunciation discrepancies. Given that informal Persian exhibits particular
characteristics, any computational model trained on formal Persian is unlikely
to transfer well to informal Persian, necessitating the creation of dedicated
treebanks for this variety. We thus detail the development of the open-source
Informal Persian Universal Dependency Treebank, a new treebank annotated within
the Universal Dependencies scheme. We then investigate the parsing of informal
Persian by training two dependency parsers on existing formal treebanks and
evaluating them on out-of-domain data, i.e. the development set of our informal
treebank. Our results show that parsers experience a substantial performance
drop when we move across the two domains, as they face more unknown tokens and
structures and fail to generalize well. Furthermore, the dependency relations
whose performance deteriorates the most represent the unique properties of the
informal variant. The ultimate goal of this study that demonstrates a broader
impact is to provide a stepping-stone to reveal the significance of informal
variants of languages, which have been widely overlooked in natural language
processing tools across languages.
Related papers
- Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation.
We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge.
Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z) - MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Principal Component Analysis as a Sanity Check for Bayesian
Phylolinguistic Reconstruction [3.652806821280741]
Tree model assumes that languages descended from a common ancestor and underwent modifications over time.
This assumption can be violated to different extents due to contact and other factors.
We propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis.
arXiv Detail & Related papers (2024-02-29T05:47:34Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools
for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency.
For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages.
For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data.
In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - The Persian Dependency Treebank Made Universal [3.4410212782758047]
This treebank contains 29107 sentences.
Our data is more compatible with Universal Dependencies than the Persian Universal Dependency Treebank (Seraji et al., 2016)
Our delexicalized Persian-to-English transfer experiments show that a parsing model trained on our data is 2% more accurate than that of Seraji et al.
arXiv Detail & Related papers (2020-09-21T22:34:13Z) - I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical
Theory [0.0]
This paper is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language.
The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts.
arXiv Detail & Related papers (2020-07-11T13:34:44Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.