BERM: Training the Balanced and Extractable Representation for Matching
to Improve Generalization Ability of Dense Retrieval
- URL: http://arxiv.org/abs/2305.11052v1
- Date: Thu, 18 May 2023 15:43:09 GMT
- Title: BERM: Training the Balanced and Extractable Representation for Matching
to Improve Generalization Ability of Dense Retrieval
- Authors: Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng
- Abstract summary: We propose a novel method to improve the generalization of dense retrieval via capturing matching signal called BERM.
Dense retrieval has shown promise in the first-stage retrieval process when trained on in-domain labeled datasets.
- Score: 54.66399120084227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense retrieval has shown promise in the first-stage retrieval process when
trained on in-domain labeled datasets. However, previous studies have found
that dense retrieval is hard to generalize to unseen domains due to its weak
modeling of domain-invariant and interpretable feature (i.e., matching signal
between two texts, which is the essence of information retrieval). In this
paper, we propose a novel method to improve the generalization of dense
retrieval via capturing matching signal called BERM. Fully fine-grained
expression and query-oriented saliency are two properties of the matching
signal. Thus, in BERM, a single passage is segmented into multiple units and
two unit-level requirements are proposed for representation as the constraint
in training to obtain the effective matching signal. One is semantic unit
balance and the other is essential matching unit extractability. Unit-level
view and balanced semantics make representation express the text in a
fine-grained manner. Essential matching unit extractability makes passage
representation sensitive to the given query to extract the pure matching
information from the passage containing complex context. Experiments on BEIR
show that our method can be effectively combined with different dense retrieval
training methods (vanilla, hard negatives mining and knowledge distillation) to
improve its generalization ability without any additional inference overhead
and target domain data.
Related papers
- Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - Topic-DPR: Topic-based Prompts for Dense Passage Retrieval [6.265789210037749]
We present Topic-DPR, a dense passage retrieval model that uses topic-based prompts.
We introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency.
arXiv Detail & Related papers (2023-10-10T13:45:24Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z) - Unsupervised Summarization by Jointly Extracting Sentences and Keywords [12.387378783627762]
RepRank is an unsupervised graph-based ranking model for extractive multi-document summarization.
We show that salient sentences and keywords can be extracted in a joint and mutual reinforcement process using our learned representations.
Experiment results with multiple benchmark datasets show that RepRank achieved the best or comparable performance in ROUGE.
arXiv Detail & Related papers (2020-09-16T05:58:00Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.