Linguistically-driven Multi-task Pre-training for Low-resource Neural
Machine Translation
- URL: http://arxiv.org/abs/2201.08070v1
- Date: Thu, 20 Jan 2022 09:10:08 GMT
- Title: Linguistically-driven Multi-task Pre-training for Low-resource Neural
Machine Translation
- Authors: Zhuoyuan Mao, Chenhui Chu, Sadao Kurohashi
- Abstract summary: We propose Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English.
JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks.
- Score: 31.225252462128626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the present study, we propose novel sequence-to-sequence pre-training
objectives for low-resource machine translation (NMT): Japanese-specific
sequence to sequence (JASS) for language pairs involving Japanese as the source
or target language, and English-specific sequence to sequence (ENSS) for
language pairs involving English. JASS focuses on masking and reordering
Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on
phrase structure masking and reordering tasks. Experiments on ASPEC
Japanese--English & Japanese--Chinese, Wikipedia Japanese--Chinese, News
English--Korean corpora demonstrate that JASS and ENSS outperform MASS and
other existing language-agnostic pre-training methods by up to +2.9 BLEU points
for the Japanese--English tasks, up to +7.0 BLEU points for the
Japanese--Chinese tasks and up to +1.3 BLEU points for English--Korean tasks.
Empirical analysis, which focuses on the relationship between individual parts
in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and
ENSS. Adequacy evaluation using LASER, human evaluation, and case studies
reveals that our proposed methods significantly outperform pre-training methods
without injected linguistic knowledge and they have a larger positive impact on
the adequacy as compared to the fluency. We release codes here:
https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.
Related papers
- JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation [63.83457341009046]
JMMMU (Japanese MMMU) is the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context.
Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation.
By combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding.
arXiv Detail & Related papers (2024-10-22T17:59:56Z) - Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks.
We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference.
Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z) - Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly [53.04368883943773]
Two approaches are proposed to address this, i.e., multilingual pretraining and multilingual instruction tuning.
We propose CLiKA to assess the cross-lingual knowledge alignment of LLMs in the Performance, Consistency and Conductivity levels.
Results show that while both multilingual pretraining and instruction tuning are beneficial for cross-lingual knowledge alignment, the training strategy needs to be carefully designed.
arXiv Detail & Related papers (2024-04-06T15:25:06Z) - Exploration of Language Dependency for Japanese Self-Supervised Speech
Representation Models [18.22157315310462]
Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings.
In this paper, we investigate how effective a cross-lingual model is in comparison with a monolingual model.
We examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data.
arXiv Detail & Related papers (2023-05-09T06:28:10Z) - JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its
Applications [4.482886054198201]
JCSE creates training data by generating sentences and synthesizing them with sentences available in a target domain.
It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain.
arXiv Detail & Related papers (2023-01-19T17:41:46Z) - Compositional Evaluation on Japanese Textual Entailment and Similarity [20.864082353441685]
Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models.
Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English.
There are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English.
arXiv Detail & Related papers (2022-08-09T15:10:56Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural
Machine Translation [27.364702152624034]
JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training.
We show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods.
We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.
arXiv Detail & Related papers (2020-05-07T09:53:25Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.