Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data
- URL: http://arxiv.org/abs/2505.19274v1
- Date: Sun, 25 May 2025 19:06:19 GMT
- Title: Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data
- Authors: Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Jimmy Lin,
- Abstract summary: We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning.<n>We find that fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models.<n>We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models.
- Score: 43.81779293196647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset's retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written queries for training. We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models. We release our model and both query generation and training code to facilitate further research.
Related papers
- Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation [43.81779293196647]
We show that standard fine-tuning methods can unexpectedly degrade effectiveness rather than improve it, even for domain-specific scenarios.<n>We explore a training strategy that uses listwise distillation from a teacher cross-encoder, leveraging rich relevance signals to fine-tune the retriever.<n>Our results also reveal that synthetic queries can rival human-written queries in training utility.
arXiv Detail & Related papers (2025-02-27T03:07:49Z) - READ: Reinforcement-based Adversarial Learning for Text Classification with Limited Labeled Data [7.152603583363887]
Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks.<n>This paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches.<n>Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning.
arXiv Detail & Related papers (2025-01-14T11:39:55Z) - PairDistill: Pairwise Relevance Distillation for Dense Retrieval [35.067998820937284]
This paper introduces Pairwise Relevance Distillation (PairDistill) to leverage pairwise reranking.
It offers fine-grained distinctions between similarly relevant documents to enrich the training of dense retrieval models.
Our experiments demonstrate that PairDistill outperforms existing methods, achieving new state-of-the-art results across multiple benchmarks.
arXiv Detail & Related papers (2024-10-02T09:51:42Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - CORE: A Retrieve-then-Edit Framework for Counterfactual Data Generation [91.16551253297588]
COunterfactual Generation via Retrieval and Editing (CORE) is a retrieval-augmented generation framework for creating diverse counterfactual perturbations for training.
CORE first performs a dense retrieval over a task-related unlabeled text corpus using a learned bi-encoder.
CORE then incorporates these into prompts to a large language model with few-shot learning capabilities, for counterfactual editing.
arXiv Detail & Related papers (2022-10-10T17:45:38Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Learning to Generate Synthetic Training Data using Gradient Matching and
Implicit Differentiation [77.34726150561087]
This article explores various data distillation techniques that can reduce the amount of data required to successfully train deep networks.
Inspired by recent ideas, we suggest new data distillation techniques based on generative teaching networks, gradient matching, and the Implicit Function Theorem.
arXiv Detail & Related papers (2022-03-16T11:45:32Z) - Bootstrapping Relation Extractors using Syntactic Search by Examples [47.11932446745022]
We propose a process for bootstrapping training datasets which can be performed quickly by non-NLP-experts.
We take advantage of search engines over syntactic-graphs which expose a friendly by-example syntax.
We show that the resulting models are competitive with models trained on manually annotated data and on data obtained from distant supervision.
arXiv Detail & Related papers (2021-02-09T18:17:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.