Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages
- URL: http://arxiv.org/abs/2304.10158v1
- Date: Thu, 20 Apr 2023 08:32:34 GMT
- Title: Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages
- Authors: Verena Blaschke, Hinrich Sch\"utze, Barbara Plank
- Abstract summary: We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
- Score: 18.210880703295253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the challenges with finetuning pretrained language models (PLMs) is
that their tokenizer is optimized for the language(s) it was pretrained on, but
brittle when it comes to previously unseen variations in the data. This can for
instance be observed when finetuning PLMs on one language and evaluating them
on data in a closely related language variety with no standardized orthography.
Despite the high linguistic similarity, tokenization no longer corresponds to
meaningful representations of the target data, leading to low performance in,
e.g., part-of-speech tagging.
In this work, we finetune PLMs on seven languages from three different
families and analyze their zero-shot performance on closely related,
non-standardized varieties. We consider different measures for the divergence
in the tokenization of the source and target data, and the way they can be
adjusted by manipulating the tokenization during the finetuning step. Overall,
we find that the similarity between the percentage of words that get split into
subwords in the source and target data (the split word ratio difference) is the
strongest predictor for model performance on target data.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot
Translation [79.96416609433724]
Zero-shot translation (ZST) aims to translate between unseen language pairs in training data.
The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs.
Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem.
arXiv Detail & Related papers (2023-09-28T17:02:36Z) - Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with
Synthetic Data [2.225882303328135]
We propose a novel Translate-and-Fill (TaF) method to produce silver training data for a multilingual semantic parsing task.
Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems.
arXiv Detail & Related papers (2021-09-09T14:51:11Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New
Approach Using XLM-RoBERTa Alignment [5.747195707763152]
We build an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences.
Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus.
We evaluate this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models.
arXiv Detail & Related papers (2021-01-26T22:19:52Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.