Related papers: Informal Persian Universal Dependency Treebank

Informal Persian Universal Dependency Treebank

URL: http://arxiv.org/abs/2201.03679v1
Date: Mon, 10 Jan 2022 22:33:07 GMT
Title: Informal Persian Universal Dependency Treebank
Authors: Roya Kabiri, Simin Karimi, Mihai Surdeanu
Abstract summary: This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian. We develop the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependency scheme.
Score: 19.359203472636835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.

Related papers

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases [47.920937001420505]
Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. We find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages.
arXiv Detail & Related papers (2025-02-26T15:55:55Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD) We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z)
Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction [3.652806821280741]
Tree model assumes that languages descended from a common ancestor and underwent modifications over time. This assumption can be violated to different extents due to contact and other factors. We propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis.
arXiv Detail & Related papers (2024-02-29T05:47:34Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z)
LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency. For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages. For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data. In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
The Persian Dependency Treebank Made Universal [3.4410212782758047]
This treebank contains 29107 sentences. Our data is more compatible with Universal Dependencies than the Persian Universal Dependency Treebank (Seraji et al., 2016) Our delexicalized Persian-to-English transfer experiments show that a parsing model trained on our data is 2% more accurate than that of Seraji et al.
arXiv Detail & Related papers (2020-09-21T22:34:13Z)
I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical Theory [0.0]
This paper is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language. The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts.
arXiv Detail & Related papers (2020-07-11T13:34:44Z)
Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages. Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.