Related papers: RETSim: Resilient and Efficient Text Similarity

RETSim: Resilient and Efficient Text Similarity

URL: http://arxiv.org/abs/2311.17264v1
Date: Tue, 28 Nov 2023 22:54:33 GMT
Title: RETSim: Resilient and Efficient Text Similarity
Authors: Marina Zhang, Owen Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein
Abstract summary: RETSim is a lightweight, multilingual deep learning model trained to produce robust metric embeddings for text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings. We also introduce the W4NT3D benchmark for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings.
Score: 1.6228944467258688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

Related papers

Multiple References with Meaningful Variations Improve Literary Machine Translation [15.399876365676116]
This paper investigates best practices for employing multiple references by analyzing the semantic similarity among different English translations of world literature in Par3 dataset. We classify the semantic similarity between paraphrases into three levels: low, medium, and high, and fine-tune three different models for literary MT tasks. Using paraphrases with medium and high semantic similarity outperforms an unfiltered dataset, with improvements in BLEU (0.3-0.5), COMET (0.1-0.9), and chrF++ (0.17-0.32)
arXiv Detail & Related papers (2024-12-24T23:49:12Z)
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries. Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z)
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval [31.79030663958162]
We propose a new text modeling method T-MASS to enrich text embedding with a flexible and resilient semantic range. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. T-MASS achieves state-of-the-art performance on five benchmark datasets.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
KETM:A Knowledge-Enhanced Text Matching method [0.0]
We introduce a new model for text matching called the Knowledge Enhanced Text Matching model (KETM) We use Wiktionary to retrieve the text word definitions as our external knowledge. We fuse text and knowledge using a gating mechanism to learn the ratio of text and knowledge fusion.
arXiv Detail & Related papers (2023-08-11T17:08:14Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content. We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z)
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning [22.17732989393653]
We present a novel multi-grained sparse learning framework, S3MA, to learn an sparse space shared between the video and the text for video-text retrieval. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Benefiting from the learned shared sparse space and multi-grained similarities, experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods.
arXiv Detail & Related papers (2023-02-19T04:03:22Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z)
Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.