Typo-Robust Representation Learning for Dense Retrieval
- URL: http://arxiv.org/abs/2306.10348v1
- Date: Sat, 17 Jun 2023 13:48:30 GMT
- Title: Typo-Robust Representation Learning for Dense Retrieval
- Authors: Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can
Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong
- Abstract summary: One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words.
A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones.
Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries.
- Score: 6.148710657178892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense retrieval is a basic building block of information retrieval
applications. One of the main challenges of dense retrieval in real-world
settings is the handling of queries containing misspelled words. A popular
approach for handling misspelled queries is minimizing the representations
discrepancy between misspelled queries and their pristine ones. Unlike the
existing approaches, which only focus on the alignment between misspelled and
pristine queries, our method also improves the contrast between each misspelled
query and its surrounding queries. To assess the effectiveness of our proposed
method, we compare it against the existing competitors using two benchmark
datasets and two base encoders. Our method outperforms the competitors in all
cases with misspelled queries. Our code and models are available at
https://github. com/panuthept/DST-DenseRetrieval.
Related papers
- Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational Search [32.35446999027349]
We leverage both rewritten queries and relevance judgments in the conversational search data to train a better query representation model.
The proposed model -- Query Representation Alignment Conversational Retriever, QRACDR, is tested on eight datasets.
arXiv Detail & Related papers (2024-07-29T17:14:36Z) - LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency [65.01402723259098]
We propose a novel method of query rewrite named LLM-R2, adopting a large language model (LLM) to propose possible rewrite rules for a database rewrite system.
Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods.
arXiv Detail & Related papers (2024-04-19T13:17:07Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Keyword Embeddings for Query Suggestion [3.7900158137749322]
This paper proposes two novel models for the keyword suggestion task trained on scientific literature.
Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents' keyword co-occurrence.
We evaluate our proposals against the state-of-the-art word and sentence embedding models showing considerable improvements over the baselines for the tasks.
arXiv Detail & Related papers (2023-01-19T11:13:04Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Decoding a Neural Retriever's Latent Space for Query Suggestion [28.410064376447718]
We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph.
We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco.
On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion.
arXiv Detail & Related papers (2022-10-21T16:19:31Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - CharacterBERT and Self-Teaching for Improving the Robustness of Dense
Retrievers on Queries with Typos [26.053028706793587]
We show that a small character level in queries (as caused by typos) highly impacts the effectiveness of dense retrievers.
In BERT, tokenization is performed using the BERT's WordPiece tokenizer.
We then turn our attention to devising dense retriever methods that are robust to such typo queries.
arXiv Detail & Related papers (2022-04-01T23:02:50Z) - End-to-End Open Vocabulary Keyword Search [13.90172596423425]
We propose a model directly optimized for keyword search.
The proposed model outperforms similar end-to-end models on a task where the ratio of positive and negative trials is artificially balanced.
Using our system to rescore the outputs an LVCSR-based keyword search system leads to significant improvements.
arXiv Detail & Related papers (2021-08-23T18:34:53Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.