Related papers: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

URL: http://arxiv.org/abs/2506.20923v5
Date: Tue, 14 Oct 2025 01:50:35 GMT
Title: KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Authors: Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang,
Abstract summary: We propose KaLM-Embedding-V2, a series of versatile and compact embedding models.<n>For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings.<n>For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation.
Score: 63.13906424204078
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.

Related papers

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition [12.731093427395985]
ADAMAB is an efficient embedding calibration framework for few-shot pattern recognition.<n>Our experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement.
arXiv Detail & Related papers (2026-02-22T23:39:21Z)
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing [67.77471070868852]
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing.<n>It is trained on only 50M samples, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.<n>By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
arXiv Detail & Related papers (2026-02-12T17:44:24Z)
Bagging-Based Model Merging for Robust General Text Embeddings [73.51674133699196]
General-purpose text embedding models underpin a wide range of NLP and information retrieval applications.<n>We present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging.<n>We propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model.
arXiv Detail & Related papers (2026-02-05T15:45:08Z)
QZhou-Embedding Technical Report [16.213081669689185]
Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies.<n>Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance.
arXiv Detail & Related papers (2025-08-29T13:47:22Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z)
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model [27.25688303240741]
KaLM-Embedding is a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data.<n>Our model has been trained with key techniques proven to enhance performance.
arXiv Detail & Related papers (2025-01-02T03:17:51Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
POINTS: Improving Your Vision-language Model with Affordable Strategies [28.611705477757454]
We train a robust baseline model using latest advancements in vision-language models. We filter pre-training data using perplexity, selecting the lowest perplexity data for training. During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements.
arXiv Detail & Related papers (2024-09-07T13:41:37Z)
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models [38.41524186248607]
We introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets.<n>For model architecture, we propose a latent attention layer to obtain pooled embeddings.<n>For training algorithm, we introduce a two-stage contrastive instruction-tuning method.
arXiv Detail & Related papers (2024-05-27T17:59:45Z)
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning [0.0]
GISTEmbed is a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes.
arXiv Detail & Related papers (2024-02-26T18:55:15Z)
Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models [39.65879784788677]
We introduce a novel training data selection based on the learning percentage of the samples. We assert that current language models possess the capability to autonomously select high-quality training data. Our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
arXiv Detail & Related papers (2024-02-16T03:39:37Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.