Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines
- URL: http://arxiv.org/abs/2106.07719v1
- Date: Mon, 1 Mar 2021 07:19:16 GMT
- Title: Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines
- Authors: Mahdi Hajiaghayi, Monir Hajiaghayi, Mark Bolin
- Abstract summary: We present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder.
This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a multi-lingual sentence encoder that can be used
in search engines as a query and document encoder. This embedding enables a
semantic similarity score between queries and documents that can be an
important feature in document ranking and relevancy. To train such a customized
sentence encoder, it is beneficial to leverage users search data in the form of
query-document clicked pairs however, we must avoid relying too much on search
click data as it is biased and does not cover many unseen cases. The search
data is heavily skewed towards short queries and for long queries is small and
often noisy. The goal is to design a universal multi-lingual encoder that works
for all cases and covers both short and long queries. We select a number of
public NLI datasets in different languages and translation data and together
with user search data we train a language model using a multi-task approach. A
challenge is that these datasets are not homogeneous in terms of content, size
and the balance ratio. While the public NLI datasets are usually two-sentence
based with the same portion of positive and negative pairs, the user search
data can contain multi-sentence documents and only positive pairs. We show how
multi-task training enables us to leverage all these datasets and exploit
knowledge sharing across these tasks.
Related papers
- QueryBuilder: Human-in-the-Loop Query Development for Information Retrieval [12.543590253664492]
We present a novel, interactive system called $textitQueryBuilder$.
It allows a novice, English-speaking user to create queries with a small amount of effort.
It rapidly develops cross-lingual information retrieval queries corresponding to the user's information needs.
arXiv Detail & Related papers (2024-09-07T00:46:58Z) - Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Leveraging Large Language Models for Multimodal Search [0.6249768559720121]
This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset.
We also propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction.
arXiv Detail & Related papers (2024-04-24T10:30:42Z) - Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation [16.170841777591345]
In most social search scenarios such as Dianping, modeling search relevance always faces two challenges.
We first take queryd with the query-based summary and the document summary without query as the input of topic relevance model.
Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data.
arXiv Detail & Related papers (2024-04-03T10:05:47Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z) - Acoustic span embeddings for multilingual query-by-example search [20.141444548841047]
In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW)
Recent work has found that methods based on acoustic word embeddings (AWEs) can improve both performance and search speed.
We generalize AWE training to spans of words, producing acoustic span embeddings (ASE), and explore the application of AWE to arbitrary-length queries in multiple unseen languages.
arXiv Detail & Related papers (2020-11-24T00:28:22Z) - Cross-Lingual Document Retrieval with Smooth Learning [31.638708227607214]
Cross-lingual document search is an information retrieval task in which the queries' language differs from the documents' language.
We propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents' languages.
arXiv Detail & Related papers (2020-11-02T03:17:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.