Related papers: Typo-Robust Representation Learning for Dense Retrieval

Typo-Robust Representation Learning for Dense Retrieval

URL: http://arxiv.org/abs/2306.10348v1
Date: Sat, 17 Jun 2023 13:48:30 GMT
Title: Typo-Robust Representation Learning for Dense Retrieval
Authors: Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong
Abstract summary: One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries.
Score: 6.148710657178892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github. com/panuthept/DST-DenseRetrieval.

Related papers

QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors [23.225358970952197]
Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs) QE-RAG is the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method.
arXiv Detail & Related papers (2025-04-05T05:24:08Z)
Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational Search [32.35446999027349]
We leverage both rewritten queries and relevance judgments in the conversational search data to train a better query representation model. The proposed model -- Query Representation Alignment Conversational Retriever, QRACDR, is tested on eight datasets.
arXiv Detail & Related papers (2024-07-29T17:14:36Z)
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency [65.01402723259098]
We propose a novel method of query rewrite named LLM-R2, adopting a large language model (LLM) to propose possible rewrite rules for a database rewrite system. Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods.
arXiv Detail & Related papers (2024-04-19T13:17:07Z)
LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results. We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z)
List-aware Reranking-Truncation Joint Model for Search and Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently. GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture. Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z)
Keyword Embeddings for Query Suggestion [3.7900158137749322]
This paper proposes two novel models for the keyword suggestion task trained on scientific literature. Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents' keyword co-occurrence. We evaluate our proposals against the state-of-the-art word and sentence embedding models showing considerable improvements over the baselines for the tasks.
arXiv Detail & Related papers (2023-01-19T11:13:04Z)
CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z)
Decoding a Neural Retriever's Latent Space for Query Suggestion [28.410064376447718]
We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion.
arXiv Detail & Related papers (2022-10-21T16:19:31Z)
UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query. Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z)
CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos [26.053028706793587]
We show that a small character level in queries (as caused by typos) highly impacts the effectiveness of dense retrievers. In BERT, tokenization is performed using the BERT's WordPiece tokenizer. We then turn our attention to devising dense retriever methods that are robust to such typo queries.
arXiv Detail & Related papers (2022-04-01T23:02:50Z)
End-to-End Open Vocabulary Keyword Search [13.90172596423425]
We propose a model directly optimized for keyword search. The proposed model outperforms similar end-to-end models on a task where the ratio of positive and negative trials is artificially balanced. Using our system to rescore the outputs an LVCSR-based keyword search system leads to significant improvements.
arXiv Detail & Related papers (2021-08-23T18:34:53Z)
Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers. We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.