Related papers: Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

URL: http://arxiv.org/abs/2409.02727v2
Date: Thu, 5 Sep 2024 07:17:59 GMT
Title: Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?
Authors: Yixuan Tang, Yi Yang,
Abstract summary: We propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This paper sheds light on effective training strategies for LLM-based embedding models.
Score: 18.990655668481075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.

Related papers

Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs? [13.077853383476974]
This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods.<n>Our findings reveal that BERT-like models often outperform LLMs.<n>We propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.
arXiv Detail & Related papers (2025-05-23T05:46:42Z)
Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection. Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z)
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning [39.54891426369773]
Trade-offs between model size, architecture, and performance remain underexplored. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures.
arXiv Detail & Related papers (2025-03-19T18:10:12Z)
Large Language Models as Attribution Regularizers for Efficient Model Training [0.0]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. We introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks. Our approach yields superior performance in few-shot learning scenarios.
arXiv Detail & Related papers (2025-02-27T16:55:18Z)
A Comprehensive Analysis on LLM-based Node Classification Algorithms [21.120619437937382]
We develop a comprehensive and testbed for node classification using Large Language Models (LLMs) It includes ten datasets, eight LLM-based algorithms, and three learning paradigms, and is designed for easy extension with new methods and datasets. We conduct extensive experiments, training and evaluating over 2,200 models, to determine the key settings that affect performance. Our findings uncover eight insights, e.g., LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting.
arXiv Detail & Related papers (2025-02-02T15:56:05Z)
The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities [51.594836904623534]
We investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. We show that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve.
arXiv Detail & Related papers (2025-01-15T10:57:55Z)
LLMs are Also Effective Embedding Models: An In-depth Overview [40.53941563464671]
Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs like GPT, LLaMA, and Mistral.
arXiv Detail & Related papers (2024-12-17T06:48:24Z)
Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective [2.2334256816037987]
Transformer-based models like BERT and GPT rely on pooling layers to aggregate token-level embeddings into sentence-level representations. Common pooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in this aggregation process. This paper investigates the effects of these pooling mechanisms on two prominent LLM families -- BERT and GPT, in the context of sentence-level sentiment analysis.
arXiv Detail & Related papers (2024-11-22T00:59:25Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline. Our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters through a model mixture.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources. Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z)
LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting [69.33802286580786]
We introduce LTSM-Bundle, a comprehensive toolbox, and benchmark for training LTSMs. It modularized and benchmarked LTSMs from multiple dimensions, encompassing prompting strategies, tokenization approaches, base model selection, data quantity, and dataset diversity. Empirical results demonstrate that this combination achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional TSF methods.
arXiv Detail & Related papers (2024-06-20T07:09:19Z)
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce the NV-Embed model with a variety of architectural designs and training procedures. Our model has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB) We open-source the model at: https://face.co/EIR/NV-Embed-v1.
arXiv Detail & Related papers (2024-05-27T17:59:45Z)
Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration [39.35476224845088]
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. We propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step.
arXiv Detail & Related papers (2024-04-19T08:52:22Z)
Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs) We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
Scaling Sentence Embeddings with Large Language Models [43.19994568210206]
In this work, we propose an in-context learning-based method aimed at improving sentence embeddings performance. Our approach involves adapting the previous prompt-based representation method for autoregressive models. By scaling model size, we find scaling to more than tens of billion parameters harms the performance on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-07-31T13:26:03Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.