Related papers: Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

URL: http://arxiv.org/abs/2406.01607v2
Date: Wed, 19 Jun 2024 06:52:13 GMT
Title: Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
Authors: Hongliu Cao,
Abstract summary: We provide an overview of the advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB) Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.

Related papers

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey [39.840208834931076]
General-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations.<n>We provide a comprehensive overview of GPTE in the era of pretrained language models (PLMs)<n>We describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation.
arXiv Detail & Related papers (2025-07-28T12:52:24Z)
Generalizing vision-language models to novel domains: A comprehensive survey [55.97518817219619]
Vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities.<n>This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures.
arXiv Detail & Related papers (2025-06-23T10:56:37Z)
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective [40.29094043868067]
We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval.<n>Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
arXiv Detail & Related papers (2025-05-21T02:59:14Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
Movie2Story: A framework for understanding videos and telling stories in the form of novel text [0.0]
We propose a novel benchmark to evaluate text generation capabilities in scenarios enriched with auxiliary information. Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information. Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics.
arXiv Detail & Related papers (2024-12-19T15:44:04Z)
When Text Embedding Meets Large Language Model: A Comprehensive Survey [17.263184207651072]
This survey focuses on the interplay between large language models (LLMs) and text embeddings. It offers a novel and systematic overview of contributions from various research and application domains. Building on this analysis, we outline prospective directions for the evolution of text embedding.
arXiv Detail & Related papers (2024-12-12T10:50:26Z)
Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks. This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z)
Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach [0.7806050661713976]
The Sustainable Development Goals were formulated by the United Nations in 2015 to address these global challenges by 2030. Natural language processing techniques can help uncover discussions on SDGs within research literature. We propose a completely automated pipeline to fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs.
arXiv Detail & Related papers (2024-11-05T09:37:23Z)
CUDRT: Benchmarking the Detection Models of Human vs. Large Language Models Generated Texts [9.682499180341273]
Large language models (LLMs) have greatly enhanced text generation across industries. Their human-like outputs make distinguishing between human and AI authorship challenging. Current benchmarks mainly rely on static datasets, limiting their effectiveness in assessing model-based detectors.
arXiv Detail & Related papers (2024-06-13T12:43:40Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation. We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques. We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z)
Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings [11.393822909537796]
An essential part of monitoring machine learning models in production is measuring input and output data drift. Recent advancements in large language models (LLMs) indicate their effectiveness in capturing semantic relationships. We propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings.
arXiv Detail & Related papers (2023-12-04T20:46:48Z)
Adapting Large Language Models to Domains via Reading Comprehension [86.24451681746676]
We explore how continued pre-training on domain-specific corpora influences large language models. We show that training on the raw corpora endows the model with domain knowledge, but drastically hurts its ability for question answering. We propose a simple method for transforming raw corpora into reading comprehension texts.
arXiv Detail & Related papers (2023-09-18T07:17:52Z)
A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data. Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs) Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z)
Pretrained Language Models for Text Generation: A Survey [46.03096493973206]
We present an overview of the major advances achieved in the topic of pretrained language models (PLMs) for text generation. We discuss how to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
arXiv Detail & Related papers (2021-05-21T12:27:44Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus. We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.