Related papers: Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

URL: http://arxiv.org/abs/2506.00363v1
Date: Sat, 31 May 2025 03:06:09 GMT
Title: Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Authors: Yubai Wei, Jiale Han, Yi Yang,
Abstract summary: BMEmbed is a novel method for adapting general-purpose text embedding models to private datasets.<n>We construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation.<n>We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance.
Score: 19.57735892785756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.

Related papers

Similarity-Based Domain Adaptation with LLMs [13.692329347889212]
Unsupervised domain adaptation leverages abundant labeled data from various source domains to generalize onto unlabeled target data.<n>This paper introduces a simple framework that utilizes the impressive generalization capabilities of Large Language Models (LLMs) for target data annotation.<n>Our framework achieves impressive performance, specifically, 2.44% accuracy improvement when compared to the SOTA method.
arXiv Detail & Related papers (2025-03-07T09:51:07Z)
Consistency Regularization for Generalizable Source-free Domain Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
Exploring Category Structure with Contextual Language Models and Lexical Semantic Networks [0.0]
We test a wider array of methods for probing CLMs for predicting typicality scores. Our experiments, using BERT, show the importance of using the right type of CLM probes. Results highlight the importance of polysemy in this task.
arXiv Detail & Related papers (2023-02-14T09:57:23Z)
OPTION: OPTImization Algorithm Benchmarking ONtology [4.060078409841919]
OPTION (OPTImization algorithm benchmarking ONtology) is a semantically rich, machine-readable data model for benchmarking algorithms. Our ontology provides the vocabulary needed for semantic annotation of the core entities involved in the benchmarking process. It also provides means for automated data integration, improved interoperability, powerful querying capabilities and reasoning.
arXiv Detail & Related papers (2021-04-24T06:11:30Z)
Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift. We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.