Related papers: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

URL: http://arxiv.org/abs/2509.12892v1
Date: Tue, 16 Sep 2025 09:48:11 GMT
Title: Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
Authors: Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen,
Abstract summary: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks.<n>In this work, we introduce Conan-embedding-v2, a new 1.4B- parameter LLM trained from scratch and fine-tuned as a text embedder.<n>Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025)
Score: 25.724646707322986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).

Related papers

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation [60.164968941945645]
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives.<n>LLaVA-Reward directly utilizes the hidden states of multimodal large language models (MLLMs)<n>We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking.
arXiv Detail & Related papers (2025-07-28T23:52:53Z)
LangBridge: Interpreting Image as a Combination of Language Embeddings [64.36674412359778]
LangBridge is a novel adapter that explicitly maps visual tokens to linear combinations of text embeddings.<n>Our results demonstrate that a LangBridge pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance.
arXiv Detail & Related papers (2025-03-25T07:24:27Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting [20.07438999071414]
Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation.<n>We present FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs.<n>We also present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model.
arXiv Detail & Related papers (2025-01-27T13:18:40Z)
Multi-Grained Patch Training for Efficient LLM-based Recommendation [40.5721110129484]
Large Language Models (LLMs) have emerged as a new paradigm for recommendation by converting interacted item history into language modeling.<n>We propose PatchRec, a multi-grained patch training method consisting of two stages: Patch Pre-training, which familiarizes LLMs with aggregated embeddings -- patches, and Patch Fine-tuning, which enables LLMs to capture time-aware significance in interaction history.
arXiv Detail & Related papers (2025-01-25T05:30:58Z)
Text-like Encoding of Collaborative Information in Large Language Models for Recommendation [58.87865271693269]
We introduce BinLLM, a novel method to seamlessly integrate collaborative information with Large Language Models for Recommendation (LLMRec) BinLLM converts collaborative embeddings from external models into binary sequences. BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths.
arXiv Detail & Related papers (2024-06-05T12:45:25Z)
LLM Attributor: Interactive Visual Attribution for LLM Generation [29.116016627864095]
Python library provides interactive visualizations for training data attribution of large language models. Our library offers a new way to quickly attribute an LLM's text generation to training data points.
arXiv Detail & Related papers (2024-04-01T13:16:34Z)
Making Large Language Models A Better Foundation For Dense Retrieval [19.38740248464456]
Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. We propose LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of dense retrieval application.
arXiv Detail & Related papers (2023-12-24T15:10:35Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.