Related papers: Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

URL: http://arxiv.org/abs/2503.09094v1
Date: Wed, 12 Mar 2025 06:15:33 GMT
Title: Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation
Authors: Zihao Chen, Hisashi Handa, Miho Ohsaki, Kimiaki Shirahama,
Abstract summary: This paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning)<n>A benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets.<n>The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset.
Score: 2.6505619784178047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.

Related papers

Logos as a Well-Tempered Pre-train for Sign Language Recognition [75.42794328290088]
This paper presents Logos, a novel Russian Sign Language (RSL) dataset.<n>It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks.<n>We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks.
arXiv Detail & Related papers (2025-05-15T16:31:49Z)
BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English [8.823927892310238]
BESSTIE is a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK)<n>We collect datasets for these language varieties using two methods: location-based and topic-based filtering.<n>Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels.
arXiv Detail & Related papers (2024-12-06T02:34:40Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications [4.482886054198201]
JCSE creates training data by generating sentences and synthesizing them with sentences available in a target domain. It is then used to generate contradictory sentence pairs that are used in contrastive learning for adapting a Japanese language model to a specific task in the target domain.
arXiv Detail & Related papers (2023-01-19T17:41:46Z)
Adversarial Adaptation for French Named Entity Recognition [21.036698406367115]
We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.
arXiv Detail & Related papers (2023-01-12T18:58:36Z)
Improving Retrieval Augmented Neural Machine Translation by Controlling Source and Fuzzy-Match Interactions [15.845071122977158]
We build on the idea of Retrieval Augmented Translation (RAT) where top-k in-domain fuzzy matches are found for the source sentence. We propose a novel architecture to control interactions between a source sentence and the top-k fuzzy target-language matches.
arXiv Detail & Related papers (2022-10-10T23:33:15Z)
Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings for Complex Word Identification [0.27998963147546146]
Complex word identification (CWI) is a cornerstone process towards proper text simplification. CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets. We propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations.
arXiv Detail & Related papers (2022-05-15T13:21:02Z)
Non-Parametric Unsupervised Domain Adaptation for Neural Machine Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval. We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z)
Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment [5.747195707763152]
We build an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences. Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus. We evaluate this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models.
arXiv Detail & Related papers (2021-01-26T22:19:52Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.