Related papers: Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

URL: http://arxiv.org/abs/2201.08070v1
Date: Thu, 20 Jan 2022 09:10:08 GMT
Title: Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation
Authors: Zhuoyuan Mao, Chenhui Chu, Sadao Kurohashi
Abstract summary: We propose Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks.
Score: 31.225252462128626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese--English & Japanese--Chinese, Wikipedia Japanese--Chinese, News English--Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese--English tasks, up to +7.0 BLEU points for the Japanese--Chinese tasks and up to +1.3 BLEU points for English--Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.

Related papers

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation [63.83457341009046]
JMMMU (Japanese MMMU) is the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. By combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding.
arXiv Detail & Related papers (2024-10-22T17:59:56Z)
Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning [47.75550640881761]
We explore cross-lingual generalization in instruction tuning by applying it to non-English tasks. We design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean.
arXiv Detail & Related papers (2024-06-13T04:10:17Z)
Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly [53.04368883943773]
Two approaches are proposed to address this, i.e., multilingual pretraining and multilingual instruction tuning. We propose CLiKA to assess the cross-lingual knowledge alignment of LLMs in the Performance, Consistency and Conductivity levels. Results show that while both multilingual pretraining and instruction tuning are beneficial for cross-lingual knowledge alignment, the training strategy needs to be carefully designed.
arXiv Detail & Related papers (2024-04-06T15:25:06Z)
Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models [18.22157315310462]
Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. In this paper, we investigate how effective a cross-lingual model is in comparison with a monolingual model. We examine how much unlabeled data collected in Japanese is needed to achieve performance comparable to a cross-lingual model pre-trained with tens of thousands of hours of English and/or multilingual data.
arXiv Detail & Related papers (2023-05-09T06:28:10Z)
JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications [4.482886054198201]
JCSE creates training data by generating sentences and synthesizing them with sentences available in a target domain. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain.
arXiv Detail & Related papers (2023-01-19T17:41:46Z)
Compositional Evaluation on Japanese Textual Entailment and Similarity [20.864082353441685]
Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models. Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English. There are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English.
arXiv Detail & Related papers (2022-08-09T15:10:56Z)
Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks. Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages. In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap. Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z)
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation [27.364702152624034]
JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training. We show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.
arXiv Detail & Related papers (2020-05-07T09:53:25Z)
Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences. Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem. We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest. A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.