CharacterBERT and Self-Teaching for Improving the Robustness of Dense
Retrievers on Queries with Typos
- URL: http://arxiv.org/abs/2204.00716v1
- Date: Fri, 1 Apr 2022 23:02:50 GMT
- Title: CharacterBERT and Self-Teaching for Improving the Robustness of Dense
Retrievers on Queries with Typos
- Authors: Shengyao Zhuang and Guido Zuccon
- Abstract summary: We show that a small character level in queries (as caused by typos) highly impacts the effectiveness of dense retrievers.
In BERT, tokenization is performed using the BERT's WordPiece tokenizer.
We then turn our attention to devising dense retriever methods that are robust to such typo queries.
- Score: 26.053028706793587
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Previous work has shown that dense retrievers are not robust to out-of-domain
and outlier queries, i.e. their effectiveness on these queries is much poorer
than what expected. In this paper, we consider a specific instance of such
queries: queries that contain typos. We show that a small character level
perturbation in queries (as caused by typos) highly impacts the effectiveness
of dense retrievers. We then demonstrate that the root cause of this resides in
the input tokenization strategy employed by BERT. In BERT, tokenization is
performed using the BERT's WordPiece tokenizer and we show that a token with a
typo will significantly change the token distributions obtained after
tokenization. This distribution change translates to changes in the input
embeddings passed to the BERT-based query encoder of dense retrievers. We then
turn our attention to devising dense retriever methods that are robust to such
typo queries, while still being as performant as previous methods on queries
without typos. For this, we use CharacterBERT as the backbone encoder and an
efficient yet effective training method, called Self-Teaching (ST), that
distills knowledge from queries without typos into the queries with typos.
Experimental results show that CharacterBERT in combination with ST achieves
significantly higher effectiveness on queries with typos compared to previous
methods. Along with these results and the open-sourced implementation of the
methods, we also provide a new passage retrieval dataset consisting of
real-world queries with typos and associated relevance assessments on the MS
MARCO corpus, thus supporting the research community in the investigation of
effective and robust dense retrievers.
Related papers
- SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - Typo-Robust Representation Learning for Dense Retrieval [6.148710657178892]
One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words.
A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones.
Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries.
arXiv Detail & Related papers (2023-06-17T13:48:30Z) - Query Rewriting for Retrieval-Augmented Large Language Models [139.242907155883]
Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline.
This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs.
arXiv Detail & Related papers (2023-05-23T17:27:50Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Error-Robust Retrieval for Chinese Spelling Check [43.56073620728942]
Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts.
Previous methods may not fully leverage the existing datasets.
We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
arXiv Detail & Related papers (2022-11-15T01:55:34Z) - Improving Query Representations for Dense Retrieval with Pseudo
Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval.
ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels.
Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z) - BERTese: Learning to Speak to BERT [50.76152500085082]
We propose a method for automatically rewriting queries into "BERTese", a paraphrase query that is directly optimized towards better knowledge extraction.
We empirically show our approach outperforms competing baselines, obviating the need for complex pipelines.
arXiv Detail & Related papers (2021-03-09T10:17:22Z) - DC-BERT: Decoupling Question and Document for Efficient Contextual
Encoding [90.85913515409275]
Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT.
We propose DC-BERT, a contextual encoding framework that has dual BERT models: an online BERT which encodes the question only once, and an offline BERT which pre-encodes all the documents and caches their encodings.
On SQuAD Open and Natural Questions Open datasets, DC-BERT achieves 10x speedup on document retrieval, while retaining most (about 98%) of the QA performance.
arXiv Detail & Related papers (2020-02-28T08:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.