GISTEmbed: Guided In-sample Selection of Training Negatives for Text
Embedding Fine-tuning
- URL: http://arxiv.org/abs/2402.16829v1
- Date: Mon, 26 Feb 2024 18:55:15 GMT
- Title: GISTEmbed: Guided In-sample Selection of Training Negatives for Text
Embedding Fine-tuning
- Authors: Aivin V. Solatorio
- Abstract summary: GISTEmbed is a novel strategy that enhances in-batch negative selection during contrastive training through a guide model.
Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Embedding models are integral to AI applications like semantic search,
personalized recommendations, and retrieval augmented generation for LLMs,
necessitating high-quality training data. However, the limited scalability of
manual data curation prompts the need for automated methods to ensure data
integrity. Traditional unsupervised triplet mining automates training data
generation, crucial for embedding model training, yet inadvertently injects
biases and noise, thereby degrading model performance. Addressing this, we
introduce GISTEmbed, a novel strategy that enhances in-batch negative selection
during contrastive training through a guide model. This approach departs from
reliance on random sampling and equal utility assumption of batch negatives,
significantly reducing noise from data quality issues and improving model
fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB),
GISTEmbed showcases consistent performance improvements across various model
sizes and achieves state-of-the-art results in select categories. This
framework enables significant enhancements for smaller models by leveraging the
capabilities of powerful yet resource-intensive large models. GISTEmbed can
potentially revolutionize the creation of highly efficient, smaller models,
democratizing access to advanced AI technologies. Making these technologies
more accessible and cost-effective, especially for applications constrained by
resources, significantly expands the impact and accessibility of
state-of-the-art AI solutions across diverse sectors.
Related papers
- YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Data Quality Aware Approaches for Addressing Model Drift of Semantic
Segmentation Models [1.6385815610837167]
This study investigates two prominent quality aware strategies to combat model drift.
The former leverages image quality assessment metrics to meticulously select high-quality training data, improving the model robustness.
The latter makes use of learned vectors feature from existing models to guide the selection of future data, aligning it with the model's prior knowledge.
arXiv Detail & Related papers (2024-02-11T18:01:52Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Deep autoregressive density nets vs neural ensembles for model-based
offline reinforcement learning [2.9158689853305693]
We consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts.
This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system.
We show that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark.
arXiv Detail & Related papers (2024-02-05T10:18:15Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Enabling Resource-efficient AIoT System with Cross-level Optimization: A
survey [20.360136850102833]
This survey aims to provide a broader optimization space for more free resource-performance tradeoffs.
By consolidating problems and techniques scattered over diverse levels, we aim to help readers understand their connections and stimulate further discussions.
arXiv Detail & Related papers (2023-09-27T08:04:24Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.