Genre as Weak Supervision for Cross-lingual Dependency Parsing
- URL: http://arxiv.org/abs/2109.04733v1
- Date: Fri, 10 Sep 2021 08:24:54 GMT
- Title: Genre as Weak Supervision for Cross-lingual Dependency Parsing
- Authors: Max M\"uller-Eberstein, Rob van der Goot and Barbara Plank
- Abstract summary: genre labels are frequently available, yet remain largely unexplored in cross-lingual setups.
We project treebank-level genre information to the finer-grained sentence level.
For 12 low-resource language treebanks, six of which are test-only, our genre-specific methods significantly outperform competitive baselines.
- Score: 18.755176247223616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that monolingual masked language models learn to
represent data-driven notions of language variation which can be used for
domain-targeted training data selection. Dataset genre labels are already
frequently available, yet remain largely unexplored in cross-lingual setups. We
harness this genre metadata as a weak supervision signal for targeted data
selection in zero-shot dependency parsing. Specifically, we project
treebank-level genre information to the finer-grained sentence level, with the
goal to amplify information implicitly stored in unsupervised contextualized
representations. We demonstrate that genre is recoverable from multilingual
contextual embeddings and that it provides an effective signal for training
data selection in cross-lingual, zero-shot scenarios. For 12 low-resource
language treebanks, six of which are test-only, our genre-specific methods
significantly outperform competitive baselines as well as recent
embedding-based methods for data selection. Moreover, genre-based data
selection provides new state-of-the-art results for three of these target
languages.
Related papers
- Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification.
Our approach involves blending supervised data from different languages during training to create a universal model.
The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Model and Data Transfer for Cross-Lingual Sequence Labelling in
Zero-Resource Settings [10.871587311621974]
We experimentally demonstrate that high capacity multilingual language models applied in a zero-shot setting consistently outperform data-based cross-lingual transfer approaches.
A detailed analysis of our results suggests that this might be due to important differences in language use.
Our results also indicate that data-based cross-lingual transfer approaches remain a competitive option when high-capacity multilingual language models are not available.
arXiv Detail & Related papers (2022-10-23T05:37:35Z) - AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial
Discriminator for Cross-Lingual NER [2.739898536581301]
We design an adversarial learning framework in which an encoder learns entity domain knowledge from labeled source-language data.
We show that the proposed method benefits strongly from this data selection process and outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-04T07:17:18Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.