Cross-Lingual Adaptation Using Universal Dependencies
- URL: http://arxiv.org/abs/2003.10816v2
- Date: Sat, 28 Mar 2020 17:09:11 GMT
- Title: Cross-Lingual Adaptation Using Universal Dependencies
- Authors: Nasrin Taghizadeh and Heshaam Faili
- Abstract summary: We show that models trained using UD parse trees for complex NLP tasks can characterize very different languages.
Based on UD parse trees, we develop several models using tree kernels and show that these models trained on the English dataset can correctly classify data of other languages.
- Score: 1.027974860479791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a cross-lingual adaptation method based on syntactic parse trees
obtained from the Universal Dependencies (UD), which are consistent across
languages, to develop classifiers in low-resource languages. The idea of UD
parsing is to capture similarities as well as idiosyncrasies among
typologically different languages. In this paper, we show that models trained
using UD parse trees for complex NLP tasks can characterize very different
languages. We study two tasks of paraphrase identification and semantic
relation extraction as case studies. Based on UD parse trees, we develop
several models using tree kernels and show that these models trained on the
English dataset can correctly classify data of other languages e.g. French,
Farsi, and Arabic. The proposed approach opens up avenues for exploiting UD
parsing in solving similar cross-lingual tasks, which is very useful for
languages that no labeled data is available for them.
Related papers
- Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure [15.564927804136852]
SPUD (Semantically Perturbed Universal Dependencies) is a framework for creating nonce treebanks for the Universal Dependencies (UD) corpora.
We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks.
arXiv Detail & Related papers (2023-11-13T17:36:58Z) - Assessment of Pre-Trained Models Across Languages and Grammars [7.466159270333272]
We aim to recover constituent and dependency structures by casting parsing as sequence labeling.
Our results show that pre-trained word vectors do not favor constituency representations of syntax over dependencies.
occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
arXiv Detail & Related papers (2023-09-20T09:23:36Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Multilingual Syntax-aware Language Modeling through Dependency Tree
Conversion [12.758523394180695]
We study the effect on neural language models (LMs) performance across nine conversion methods and five languages.
On average, the performance of our best model represents a 19 % increase in accuracy over the worst choice across all languages.
Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.
arXiv Detail & Related papers (2022-04-19T03:56:28Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology
with Deep Learning [0.0]
We propose two approaches to dependency parsing especially for languages with restricted amount of training data.
Our first approach combines a state-of-the-art deep learning-based with a rule-based approach and the second one incorporates morphological information into the network.
The proposed methods are developed for Turkish, but can be adapted to other languages as well.
arXiv Detail & Related papers (2020-02-24T08:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.