JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its
Applications
- URL: http://arxiv.org/abs/2301.08193v1
- Date: Thu, 19 Jan 2023 17:41:46 GMT
- Title: JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its
Applications
- Authors: Zihao Chen, Hisashi Handa, Kimiaki Shirahama
- Abstract summary: JCSE creates training data by generating sentences and synthesizing them with sentences available in a target domain.
It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain.
- Score: 4.482886054198201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning is widely used for sentence representation learning.
Despite this prevalence, most studies have focused exclusively on English and
few concern domain adaptation for domain-specific downstream tasks, especially
for low-resource languages like Japanese, which are characterized by
insufficient target domain data and the lack of a proper training strategy. To
overcome this, we propose a novel Japanese sentence representation framework,
JCSE (derived from ``Contrastive learning of Sentence Embeddings for
Japanese''), that creates training data by generating sentences and
synthesizing them with sentences available in a target domain. Specifically, a
pre-trained data generator is finetuned to a target domain using our collected
corpus. It is then used to generate contradictory sentence pairs that are used
in contrastive learning for adapting a Japanese language model to a specific
task in the target domain.
Another problem of Japanese sentence representation learning is the
difficulty of evaluating existing embedding methods due to the lack of
benchmark datasets. Thus, we establish a comprehensive Japanese Semantic
Textual Similarity (STS) benchmark on which various embedding models are
evaluated. Based on this benchmark result, multiple embedding methods are
chosen and compared with JCSE on two domain-specific tasks, STS in a clinical
domain and information retrieval in an educational domain. The results show
that JCSE achieves significant performance improvement surpassing direct
transfer and other training strategies. This empirically demonstrates JCSE's
effectiveness and practicability for downstream tasks of a low-resource
language.
Related papers
- Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning [5.5119571570277826]
Cross-lingual word alignment plays a crucial role in various natural language processing tasks.
Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings.
We propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework.
arXiv Detail & Related papers (2024-07-06T11:56:41Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Compositional Evaluation on Japanese Textual Entailment and Similarity [20.864082353441685]
Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models.
Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English.
There are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English.
arXiv Detail & Related papers (2022-08-09T15:10:56Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Linguistically-driven Multi-task Pre-training for Low-resource Neural
Machine Translation [31.225252462128626]
We propose Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English.
JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks.
arXiv Detail & Related papers (2022-01-20T09:10:08Z) - AStitchInLanguageModels: Dataset and Methods for the Exploration of
Idiomaticity in Pre-Trained Language Models [7.386862225828819]
This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings.
We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms.
arXiv Detail & Related papers (2021-09-09T16:53:17Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.