Enhancing Lexicon-Based Text Embeddings with Large Language Models
- URL: http://arxiv.org/abs/2501.09749v1
- Date: Thu, 16 Jan 2025 18:57:20 GMT
- Title: Enhancing Lexicon-Based Text Embeddings with Large Language Models
- Authors: Yibin Lei, Tao Shen, Yu Cao, Andrew Yates,
- Abstract summary: Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks.<n>LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies.<n>LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB)
- Score: 19.91595650613768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
Related papers
- ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation [52.794544682493814]
Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries.<n>We propose ClusterFusion, a hybrid framework that treats the LLM as the clustering core, guided by lightweight embedding methods.<n> Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion achieves state-of-the-art performance on standard tasks.
arXiv Detail & Related papers (2025-12-04T00:49:43Z) - ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z) - LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering [52.41664454251679]
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering.<n>Existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach.<n>We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task.
arXiv Detail & Related papers (2025-11-19T13:22:08Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation [6.004292247258359]
We show how to densely align images with synthetic descriptions generated by generative vision-language models.<n>Our approach outperforms prior work on standard zero-shot open-vocabulary segmentation benchmarks/datasets.
arXiv Detail & Related papers (2025-09-15T12:26:47Z) - Training LLMs to be Better Text Embedders through Bidirectional Reconstruction [37.53732954585151]
We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding.<n>This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query)
arXiv Detail & Related papers (2025-09-03T04:55:26Z) - Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval [24.53573526375476]
Reasoning-Infused Text Embedding builds upon existing language model embedding techniques by generating intermediate reasoning texts before computing embeddings.<n>Results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains.
arXiv Detail & Related papers (2025-08-29T23:22:34Z) - Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models [3.8688081072587326]
Causal2Vec is a general-purpose embedding model tailored to enhance the performance of decoder-only large language models.<n>We first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token.<n>To mitigate the recency bias by last-token pooling, we introduced the last hidden states of Contextual and EOS tokens as the final text embedding.
arXiv Detail & Related papers (2025-07-31T10:01:11Z) - Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [6.549601823162279]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z) - LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [56.474856189865946]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z) - Cost-Effective Text Clustering with Large Language Models [15.179854529085544]
This paper proposes TECL, a cost-effective framework that taps into the feedback from large language models for accurate text clustering.
Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs.
Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering.
arXiv Detail & Related papers (2025-04-22T06:57:49Z) - SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection [16.89965584177711]
Recent open-vocabulary human-object interaction (OV-HOI) detection methods rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories.
Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities.
arXiv Detail & Related papers (2025-03-01T09:26:05Z) - Model Generalization on Text Attribute Graphs: Principles with Large Language Models [14.657522068231138]
Large language models (LLMs) have been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce.
We develop a framework for inference over text-attributed graphs (TAGs) based on task-adaptive embeddings and a generalizable graph information aggregation mechanism.
Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-02-17T14:31:00Z) - Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [69.67015515485349]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions.
We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z) - AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.
Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z) - Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series [3.453940014682793]
We propose Context-Alignment to align time series (TS) data with a linguistic component in language environments familiar to Large Language Models (LLMs)<n>Such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs)<n>Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting.
arXiv Detail & Related papers (2025-01-07T12:40:35Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Towards Scalable Semantic Representation for Recommendation [65.06144407288127]
Mixture-of-Codes is proposed to construct semantic IDs based on large language models (LLMs)
Our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.
arXiv Detail & Related papers (2024-10-12T15:10:56Z) - Text Clustering as Classification with LLMs [9.128151647718251]
We propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of Large Language Models.<n>By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention.<n> Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization.
EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity.
The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - Bridging the Gap between Different Vocabularies for LLM Ensemble [10.669552498083709]
vocabulary discrepancies among various large language models (LLMs) have constrained previous studies.
We propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA)
EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step.
arXiv Detail & Related papers (2024-04-15T06:28:20Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Label Words are Anchors: An Information Flow Perspective for
Understanding In-Context Learning [77.7070536959126]
In-context learning (ICL) emerges as a promising capability of large language models (LLMs)
In this paper, we investigate the working mechanism of ICL through an information flow lens.
We introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL.
arXiv Detail & Related papers (2023-05-23T15:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.