Related papers: JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications

JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications

URL: http://arxiv.org/abs/2301.08193v1
Date: Thu, 19 Jan 2023 17:41:46 GMT
Title: JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications
Authors: Zihao Chen, Hisashi Handa, Kimiaki Shirahama
Abstract summary: JCSE creates training data by generating sentences and synthesizing them with sentences available in a target domain. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain.
Score: 4.482886054198201
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive learning is widely used for sentence representation learning. Despite this prevalence, most studies have focused exclusively on English and few concern domain adaptation for domain-specific downstream tasks, especially for low-resource languages like Japanese, which are characterized by insufficient target domain data and the lack of a proper training strategy. To overcome this, we propose a novel Japanese sentence representation framework, JCSE (derived from ``Contrastive learning of Sentence Embeddings for Japanese''), that creates training data by generating sentences and synthesizing them with sentences available in a target domain. Specifically, a pre-trained data generator is finetuned to a target domain using our collected corpus. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain. Another problem of Japanese sentence representation learning is the difficulty of evaluating existing embedding methods due to the lack of benchmark datasets. Thus, we establish a comprehensive Japanese Semantic Textual Similarity (STS) benchmark on which various embedding models are evaluated. Based on this benchmark result, multiple embedding methods are chosen and compared with JCSE on two domain-specific tasks, STS in a clinical domain and information retrieval in an educational domain. The results show that JCSE achieves significant performance improvement surpassing direct transfer and other training strategies. This empirically demonstrates JCSE's effectiveness and practicability for downstream tasks of a low-resource language.

Related papers

Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation [2.6505619784178047]
This paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) A benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset.
arXiv Detail & Related papers (2025-03-12T06:15:33Z)
Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning [5.5119571570277826]
Cross-lingual word alignment plays a crucial role in various natural language processing tasks. Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings. We propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework.
arXiv Detail & Related papers (2024-07-06T11:56:41Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Sentence Representation Learning with Generative Objective rather than Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction. Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z)
Compositional Evaluation on Japanese Textual Entailment and Similarity [20.864082353441685]
Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models. Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English. There are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English.
arXiv Detail & Related papers (2022-08-09T15:10:56Z)
Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks. Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages. In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap. Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation [31.225252462128626]
We propose Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks.
arXiv Detail & Related papers (2022-01-20T09:10:08Z)
AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models [7.386862225828819]
This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms.
arXiv Detail & Related papers (2021-09-09T16:53:17Z)
On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data. We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z)
Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models. We evaluate our models on domain adaptation, low-resource, and high-resource MT settings. Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.