Jasper and Stella: distillation of SOTA embedding models
- URL: http://arxiv.org/abs/2412.19048v2
- Date: Thu, 23 Jan 2025 16:01:22 GMT
- Title: Jasper and Stella: distillation of SOTA embedding models
- Authors: Dun Zhang, Jiacheng Li, Ziyang Zeng, Fulong Wang,
- Abstract summary: We propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple teacher embedding models.<n>We utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively.<n>Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the Massive Text Embedding Benchmark leaderboard.
- Score: 8.708650717134008
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the MTEB leaderboard (as of December 24, 2024), achieving an average 71.54 score across 56 datasets. We have released the model and data on the Hugging Face Hub (https://huggingface.co/infgrad/jasper_en_vision_language_v1) (https://huggingface.co/datasets/infgrad/jasper_text_distill_dataset), and the training codes are available in this project repository (https://github.com/NLPJCL/RAG-Retrieval).
Related papers
- Training Sparse Mixture Of Experts Text Embedding Models [0.0]
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts.
This scaling approach introduces significant deployment challenges, including increased inference latency and memory usage.
We introduce Nomic Embed v2, the first general purpose MoE text embedding model.
arXiv Detail & Related papers (2025-02-11T21:36:31Z) - Can bidirectional encoder become the ultimate winner for downstream applications of foundation models? [1.8120356834558644]
Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning.<n>BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model.<n>This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model.
arXiv Detail & Related papers (2024-11-27T03:31:14Z) - Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers [0.0]
We propose a robust ensemble of classifiers to serve as Teacher models in image-based parking space classification.
These Teacher models are distilled into lightweight and specialized Student models that can be deployed directly on edge devices.
Our results show that the Student models, with 26 times fewer parameters than the Teacher models, achieved an average accuracy of 96.6% on the target test datasets.
arXiv Detail & Related papers (2024-10-07T20:29:42Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.
We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.
Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models [5.2094499417507105]
This report describes the training dataset creation and recipe behind the family of textttarctic-embed text embedding models.
At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard.
arXiv Detail & Related papers (2024-05-08T19:05:18Z) - Foundational GPT Model for MEG [3.524869467682149]
We propose two classes of deep learning foundational models that can be trained using forecasting of unlabelled brain signals.
First, we consider a modified Wavenet; and second, we consider a modified Transformer-based (GPT2) model.
We compare the performance of these deep learning models with standard linear autoregressive (AR) modelling on MEG data.
arXiv Detail & Related papers (2024-04-14T13:48:24Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Who's Harry Potter? Approximate Unlearning in LLMs [4.821438899378393]
Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content.
This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
We propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
arXiv Detail & Related papers (2023-10-03T17:48:14Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - RGB-D Saliency Detection via Cascaded Mutual Information Minimization [122.8879596830581]
Existing RGB-D saliency detection models do not explicitly encourage RGB and depth to achieve effective multi-modal learning.
We introduce a novel multi-stage cascaded learning framework via mutual information minimization to "explicitly" model the multi-modal information between RGB image and depth data.
arXiv Detail & Related papers (2021-09-15T12:31:27Z) - It's the Best Only When It Fits You Most: Finding Related Models for
Serving Based on Dynamic Locality Sensitive Hashing [1.581913948762905]
Preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research.
This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models.
arXiv Detail & Related papers (2020-10-13T22:52:13Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z) - SOLAR: Sparse Orthogonal Learned and Random Embeddings [45.920844071257754]
We argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy.
We train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets.
We achieve superior precision and recall compared to the respective state-of-the-art baselines for each of the tasks with up to 10 times faster speed.
arXiv Detail & Related papers (2020-08-30T17:35:35Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.