Quantifying syntax similarity with a polynomial representation of
dependency trees
- URL: http://arxiv.org/abs/2211.07005v1
- Date: Sun, 13 Nov 2022 19:55:08 GMT
- Title: Quantifying syntax similarity with a polynomial representation of
dependency trees
- Authors: Pengyu Liu, Tinghao Feng, Rui Liu
- Abstract summary: We introduce a graph that distinguishes tree structures to represent dependency grammar.
The encodes accurate and comprehensive information about the dependency structure and dependency relations of words in a sentence.
- Score: 4.1542266070946745
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce a graph polynomial that distinguishes tree structures to
represent dependency grammar and a measure based on the polynomial
representation to quantify syntax similarity. The polynomial encodes accurate
and comprehensive information about the dependency structure and dependency
relations of words in a sentence. We apply the polynomial-based methods to
analyze sentences in the Parallel Universal Dependencies treebanks.
Specifically, we compare the syntax of sentences and their translations in
different languages, and we perform a syntactic typology study of available
languages in the Parallel Universal Dependencies treebanks. We also demonstrate
and discuss the potential of the methods in measuring syntax diversity of
corpora.
Related papers
- Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - A Compositional Typed Semantics for Universal Dependencies [26.65442947858347]
We introduce UD Type Calculus, a compositional, principled, and language-independent system of semantic types and logical forms for lexical items.
We explain the essential features of UD Type Calculus, which all involve giving dependency relations denotations just like those of words.
We present results on a large existing corpus of sentences and their logical forms, showing that UD-TC can produce meanings comparable with our baseline.
arXiv Detail & Related papers (2024-03-02T11:58:24Z) - A Joint Matrix Factorization Analysis of Multilingual Representations [28.751144371901958]
We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models.
We study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models.
arXiv Detail & Related papers (2023-10-24T04:43:45Z) - Assessment of Pre-Trained Models Across Languages and Grammars [7.466159270333272]
We aim to recover constituent and dependency structures by casting parsing as sequence labeling.
Our results show that pre-trained word vectors do not favor constituency representations of syntax over dependencies.
occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
arXiv Detail & Related papers (2023-09-20T09:23:36Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Oracle Linguistic Graphs Complement a Pretrained Transformer Language
Model: A Cross-formalism Comparison [13.31232311913236]
We examine the extent to which, in principle, linguistic graph representations can complement and improve neural language modeling.
We find that, overall, semantic constituency structures are most useful to language modeling performance.
arXiv Detail & Related papers (2021-12-15T04:29:02Z) - Linguistic dependencies and statistical dependence [76.89273585568084]
We use pretrained language models to estimate probabilities of words in context.
We find that maximum-CPMI trees correspond to linguistic dependencies more often than trees extracted from non-contextual PMI estimate.
arXiv Detail & Related papers (2021-04-18T02:43:37Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.