Related papers: Cross-lingual paraphrase identification

Cross-lingual paraphrase identification

URL: http://arxiv.org/abs/2406.15066v1
Date: Fri, 21 Jun 2024 11:37:24 GMT
Title: Cross-lingual paraphrase identification
Authors: Inessa Fedorova, Aleksei Musatow,
Abstract summary: We train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. Our performance is comparable to state-of-the-art cross-encoders.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.

Related papers

Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples [38.18495961129682]
This paper introduces a novel cross-lingual search task that does not require a large semantic corpus. It focuses on the ability of a model to cross-lingually rank the true parallel sentence higher than challenging distractors generated by a large language model. We create a case study of our introduced CLSD task for the language pair German-French in the news domain.
arXiv Detail & Related papers (2025-02-12T18:54:37Z)
Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model. Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z)
Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval. We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Multilingual Representation Distillation with Contrastive Learning [20.715534360712425]
We integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences. We validate our approach with multilingual similarity search and corpus filtering tasks.
arXiv Detail & Related papers (2022-10-10T22:27:04Z)
Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
A Multi-cascaded Model with Data Augmentation for Enhanced Paraphrase Detection in Short Texts [1.6758573326215689]
We present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our model is both wide and deep and provides greater robustness across clean and noisy short texts.
arXiv Detail & Related papers (2019-12-27T12:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.