Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- URL: http://arxiv.org/abs/2405.05374v1
- Date: Wed, 8 May 2024 19:05:18 GMT
- Title: Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- Authors: Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos,
- Abstract summary: This report describes the training dataset creation and recipe behind the family of textttarctic-embed text embedding models.
At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard.
- Score: 5.2094499417507105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.
Related papers
- Dewey Long Context Embedding Model: A Technical Report [0.0]
dewey_en_beta is a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark.
This report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model.
arXiv Detail & Related papers (2025-03-26T09:55:00Z) - Granite Embedding Models [26.86244952892162]
We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks.
This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts.
We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.
arXiv Detail & Related papers (2025-02-27T15:45:16Z) - Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation [92.17176311351469]
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework.
Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale.
Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs.
arXiv Detail & Related papers (2025-02-04T18:18:50Z) - Jasper and Stella: distillation of SOTA embedding models [8.708650717134008]
We propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple teacher embedding models.
We utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively.
Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the Massive Text Embedding Benchmark leaderboard.
arXiv Detail & Related papers (2024-12-26T04:05:28Z) - TÜLU 3: Pushing Frontiers in Open Language Model Post-Training [94.14908801708049]
We introduce T"ULU 3, a family of fully-open state-of-the-art post-trained models.
T"ULU 3 builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku.
arXiv Detail & Related papers (2024-11-22T18:44:04Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models [146.18107944503436]
Molmo is a new family of VLMs that are state-of-the-art in their class of openness.
Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators.
We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future.
arXiv Detail & Related papers (2024-09-25T17:59:51Z) - xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs)
The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.
Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - WAVE: Weight Template for Adaptive Initialization of Variable-sized Models [37.97945436202779]
WAVE achieves state-of-the-art performance when initializing models with various depth and width.
WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets.
arXiv Detail & Related papers (2024-06-25T12:43:33Z) - A Three-Phases SFT Hybrid Model Integrated Strong Prior Module and Data Overlap Estimation in the Eduation Context [0.0]
We propose an end-to-end prior-based three-phases supervised fine-tuned model.
Our model realizes the structural disassembly and incremental guided output of educational knowledge.
Our model also achieves state-of-the-art in code abilities compared to open-source models.
arXiv Detail & Related papers (2024-03-13T05:38:39Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - Who's Harry Potter? Approximate Unlearning in LLMs [4.821438899378393]
Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content.
This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
We propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
arXiv Detail & Related papers (2023-10-03T17:48:14Z) - Abstractive Text Summarization based on Language Model Conditioning and
Locality Modeling [4.525267347429154]
We train a Transformer-based neural model on the BERT language model.
In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size.
The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset.
arXiv Detail & Related papers (2020-03-29T14:00:17Z) - Model Reuse with Reduced Kernel Mean Embedding Specification [70.044322798187]
We present a two-phase framework for finding helpful models for a current application.
In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model.
Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification.
arXiv Detail & Related papers (2020-01-20T15:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.