On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
- URL: http://arxiv.org/abs/2507.20783v1
- Date: Mon, 28 Jul 2025 12:52:24 GMT
- Title: On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
- Authors: Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang,
- Abstract summary: General-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations.<n>We provide a comprehensive overview of GPTE in the era of pretrained language models (PLMs)<n>We describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation.
- Score: 39.840208834931076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
Related papers
- Generalizing vision-language models to novel domains: A comprehensive survey [55.97518817219619]
Vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities.<n>This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures.
arXiv Detail & Related papers (2025-06-23T10:56:37Z) - Large Language Models in Argument Mining: A Survey [15.041650203089057]
Argument Mining (AM) focuses on extracting argumentative structures from text.<n>The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning.<n>This survey systematically synthesizes recent advancements in LLM-driven AM.
arXiv Detail & Related papers (2025-06-19T15:12:58Z) - An Empirical Study of Federated Prompt Learning for Vision Language Model [50.73746120012352]
This paper systematically investigates behavioral differences between language prompt learning and vision prompt learning.<n>We conduct experiments to evaluate the impact of various fl and prompt configurations, such as client scale, aggregation strategies, and prompt length.<n>We explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist.
arXiv Detail & Related papers (2025-05-29T03:09:15Z) - How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z) - POI-Enhancer: An LLM-based Semantic Enhancement Framework for POI Representation Learning [34.93661259065691]
Recent studies have shown that enriching POI representations with multimodal information can significantly enhance their task performance.<n>Large language models (LLMs) trained on extensive text data have been found to possess rich textual knowledge.<n>We propose POI-Enhancer, a portable framework that leverages LLMs to improve POI representations produced by classic POI learning models.
arXiv Detail & Related papers (2025-02-14T09:34:24Z) - When Text Embedding Meets Large Language Model: A Comprehensive Survey [17.263184207651072]
This survey focuses on the interplay between large language models (LLMs) and text embeddings.<n>It offers a novel and systematic overview of contributions from various research and application domains.<n>Building on this analysis, we outline prospective directions for the evolution of text embedding.
arXiv Detail & Related papers (2024-12-12T10:50:26Z) - Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark [0.0]
We provide an overview of the advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB)
Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
arXiv Detail & Related papers (2024-05-27T09:52:54Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z) - A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data.
Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs)
Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z) - Pretrained Language Models for Text Generation: A Survey [46.03096493973206]
We present an overview of the major advances achieved in the topic of pretrained language models (PLMs) for text generation.
We discuss how to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
arXiv Detail & Related papers (2021-05-21T12:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.