Text Embeddings by Weakly-Supervised Contrastive Pre-training
- URL: http://arxiv.org/abs/2212.03533v2
- Date: Thu, 22 Feb 2024 06:21:51 GMT
- Title: Text Embeddings by Weakly-Supervised Contrastive Pre-training
- Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin
Jiang, Rangan Majumder, Furu Wei
- Abstract summary: E5 is a family of state-of-the-art text embeddings that transfer well to a wide range of tasks.
E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts.
- Score: 98.31785569325402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents E5, a family of state-of-the-art text embeddings that
transfer well to a wide range of tasks. The model is trained in a contrastive
manner with weak supervision signals from our curated large-scale text pair
dataset (called CCPairs). E5 can be readily used as a general-purpose embedding
model for any tasks requiring a single-vector representation of texts such as
retrieval, clustering, and classification, achieving strong performance in both
zero-shot and fine-tuned settings. We conduct extensive evaluations on 56
datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the
first model that outperforms the strong BM25 baseline on the BEIR retrieval
benchmark without using any labeled data. When fine-tuned, E5 obtains the best
results on the MTEB benchmark, beating existing embedding models with 40x more
parameters.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce the NV-Embed model with a variety of architectural designs and training procedures.
Our model has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB)
We open-source the model at: https://face.co/EIR/NV-Embed-v1.
arXiv Detail & Related papers (2024-05-27T17:59:45Z) - Neural Summarization of Electronic Health Records [8.784162652042957]
We studied the viability of the automatic generation of various sections of a discharge summary using four state-of-the-art neural network summarization models.
Fine-tuning language models that were previously instruction fine-tuned showed better performance in summarizing entire reports.
arXiv Detail & Related papers (2023-05-24T15:05:53Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - Scaling Instruction-Finetuned Language Models [126.4789306516927]
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance.
We find that instruction finetuning dramatically improves performance on a variety of model classes.
arXiv Detail & Related papers (2022-10-20T16:58:32Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.