Multi-view Subword Regularization
- URL: http://arxiv.org/abs/2103.08490v1
- Date: Mon, 15 Mar 2021 16:07:42 GMT
- Title: Multi-view Subword Regularization
- Authors: Xinyi Wang, Sebastian Ruder, Graham Neubig
- Abstract summary: Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
- Score: 111.04350390045705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual pretrained representations generally rely on subword
segmentation algorithms to create a shared multilingual vocabulary. However,
standard heuristic algorithms often lead to sub-optimal segmentation,
especially for languages with limited amounts of data. In this paper, we take
two major steps towards alleviating this problem. First, we demonstrate
empirically that applying existing subword regularization methods(Kudo, 2018;
Provilkov et al., 2020) during fine-tuning of pre-trained multilingual
representations improves the effectiveness of cross-lingual transfer. Second,
to take full advantage of different possible input segmentations, we propose
Multi-view Subword Regularization (MVR), a method that enforces the consistency
between predictions of using inputs tokenized by the standard and probabilistic
segmentations. Results on the XTREME multilingual benchmark(Hu et al., 2020)
show that MVR brings consistent improvements of up to 2.5 points over using
standard segmentation algorithms.
Related papers
- MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Single Model Ensemble for Subword Regularized Models in Low-Resource
Machine Translation [25.04086897886412]
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
We propose an inference strategy to address this discrepancy.
Experimental results show that the proposed strategy improves the performance of models trained with subword regularization.
arXiv Detail & Related papers (2022-03-25T09:25:47Z) - PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence
Pretraining [19.785343302320918]
We present PARADISE (PARAllel & Denoising Integration in SEquence-to-sequence models)
It extends the conventional denoising objective used to train these models by (i) replacing words in the noised sequence according to a multilingual dictionary, and (ii) predicting the reference translation according to a parallel corpus.
Our experiments on machine translation and cross-lingual natural language inference show an average improvement of 2.0 BLEU points and accuracy 6.7 points from integrating parallel data into pretraining, respectively.
arXiv Detail & Related papers (2021-08-04T07:32:56Z) - Consistency Regularization for Cross-Lingual Fine-Tuning [61.08704789561351]
We propose to improve cross-lingual fine-tuning with consistency regularization.
Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations.
Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks.
arXiv Detail & Related papers (2021-06-15T15:35:44Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.