From Distillation to Hard Negative Sampling: Making Sparse Neural IR
Models More Effective
- URL: http://arxiv.org/abs/2205.04733v2
- Date: Thu, 12 May 2022 14:58:24 GMT
- Title: From Distillation to Hard Negative Sampling: Making Sparse Neural IR
Models More Effective
- Authors: Thibault Formal, Carlos Lassance, Benjamin Piwowarski, St\'ephane
Clinchant
- Abstract summary: We build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models.
We study the link between effectiveness and efficiency, on in-domain and zero-shot settings.
- Score: 15.542082655342476
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Neural retrievers based on dense representations combined with Approximate
Nearest Neighbors search have recently received a lot of attention, owing their
success to distillation and/or better sampling of examples for training --
while still relying on the same backbone architecture. In the meantime, sparse
representation learning fueled by traditional inverted indexing techniques has
seen a growing interest, inheriting from desirable IR priors such as explicit
lexical matching. While some architectural variants have been proposed, a
lesser effort has been put in the training of such models. In this work, we
build on SPLADE -- a sparse expansion-based retriever -- and show to which
extent it is able to benefit from the same training improvements as dense
models, by studying the effect of distillation, hard-negative mining as well as
the Pre-trained Language Model initialization. We furthermore study the link
between effectiveness and efficiency, on in-domain and zero-shot settings,
leading to state-of-the-art results in both scenarios for sufficiently
expressive models.
Related papers
- Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling [2.91204440475204]
Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models.
They rely on sequential denoising steps during sample generation.
We propose a novel method that integrates denoising phases directly into the model's architecture.
arXiv Detail & Related papers (2024-05-31T08:19:44Z) - Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler [16.13996677489119]
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully.
Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models.
We show that a proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
arXiv Detail & Related papers (2023-12-05T11:40:38Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Bayesian sparsification for deep neural networks with Bayesian model
reduction [0.6144680854063939]
We advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights.
BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model.
We illustrate the potential of BMR across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision and Transformers-Mixers.
arXiv Detail & Related papers (2023-09-21T14:10:47Z) - ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model.
Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models.
The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Distributionally Robust Recurrent Decoders with Random Network
Distillation [93.10261573696788]
We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to disregard OOD context during inference.
We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.
arXiv Detail & Related papers (2021-10-25T19:26:29Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Nested-Wasserstein Self-Imitation Learning for Sequence Generation [158.19606942252284]
We propose the concept of nested-Wasserstein distance for distributional semantic matching.
A novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences.
arXiv Detail & Related papers (2020-01-20T02:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.