Multilingual Transformers for Product Matching -- Experiments and a New
Benchmark in Polish
- URL: http://arxiv.org/abs/2205.15712v2
- Date: Wed, 1 Jun 2022 07:59:45 GMT
- Title: Multilingual Transformers for Product Matching -- Experiments and a New
Benchmark in Polish
- Authors: Micha{\l} Mo\.zd\.zonek, Anna Wr\'oblewska, Sergiy Tkachuk, Szymon
{\L}ukasik
- Abstract summary: The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem.
We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons.
We prepared a new dataset entirely in Polish, which allows comparing the effectiveness of the pre-trained models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Product matching corresponds to the task of matching identical products
across different data sources. It typically employs available product features
which, apart from being multimodal, i.e., comprised of various data types,
might be non-homogeneous and incomplete. The paper shows that pre-trained,
multilingual Transformer models, after fine-tuning, are suitable for solving
the product matching problem using textual features both in English and Polish
languages. We tested multilingual mBERT and XLM-RoBERTa models in English on
Web Data Commons - training dataset and gold standard for large-scale product
matching. The obtained results show that these models perform similarly to the
latest solutions tested on this set, and in some cases, the results were even
better.
Additionally, we prepared a new dataset entirely in Polish and based on
offers in selected categories obtained from several online stores for the
research purpose. It is the first open dataset for product matching tasks in
Polish, which allows comparing the effectiveness of the pre-trained models.
Thus, we also showed the baseline results obtained by the fine-tuned mBERT and
XLM-RoBERTa models on the Polish datasets.
Related papers
- Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training [12.29061850090405]
We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline.
Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup.
arXiv Detail & Related papers (2024-10-28T07:30:05Z) - MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French.
We compare 51 carefully selected embedding models on a large scale.
We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z) - Constructing Multilingual Code Search Dataset Using Neural Machine
Translation [48.32329232202801]
We create a multilingual code search dataset in four natural and four programming languages.
Our results show that the model pre-trained with all natural and programming language data has performed best in most cases.
arXiv Detail & Related papers (2023-06-27T16:42:36Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Cross-Lingual Low-Resource Set-to-Description Retrieval for Global
E-Commerce [83.72476966339103]
Cross-lingual information retrieval is a new task in cross-border e-commerce.
We propose a novel cross-lingual matching network (CLMN) with the enhancement of context-dependent cross-lingual mapping.
Experimental results indicate that our proposed CLMN yields impressive results on the challenging task.
arXiv Detail & Related papers (2020-05-17T08:10:51Z) - KLEJ: Comprehensive Benchmark for Polish Language Understanding [4.702729080310267]
We introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard.
We also release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks.
arXiv Detail & Related papers (2020-05-01T21:55:40Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.