CSPLADE: Learned Sparse Retrieval with Causal Language Models
- URL: http://arxiv.org/abs/2504.10816v3
- Date: Thu, 06 Nov 2025 19:04:06 GMT
- Title: CSPLADE: Learned Sparse Retrieval with Causal Language Models
- Authors: Zhichao Xu, Aosong Feng, Yijun Tian, Haibo Ding, Lin Lee Cheong,
- Abstract summary: We identify two challenges in training large language models (LLM) for Learned sparse retrieval (LSR)<n>We propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information.<n>With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size.
- Score: 13.999080540889494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM's unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.
Related papers
- LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum [73.82125917416067]
LACONIC is a family of learned sparse retrievers based on the Llama-3 architecture.<n>The 8B variant achieves a state-of-the-art 60.2 nDCG on the MTEB Retrieval benchmark, ranking 15th on the leaderboard.
arXiv Detail & Related papers (2026-01-04T22:42:20Z) - Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z) - LLMs as Sparse Retrievers:A Framework for First-Stage Product Search [103.70006474544364]
Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day.<n>Sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios.<n>With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues.<n>We propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers.
arXiv Detail & Related papers (2025-10-21T11:13:21Z) - Large Reasoning Embedding Models: Towards Next-Generation Dense Retrieval Paradigm [16.78399933831573]
We propose the Large Reasoning Embedding Model (LREM), which integrates reasoning processes into representation learning.<n>For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval.<n>This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy.
arXiv Detail & Related papers (2025-10-16T05:37:39Z) - Harnessing the Power of Reinforcement Learning for Language-Model-Based Information Retriever via Query-Document Co-Augmentation [35.70731674603417]
Large Language Models (LLMs) can be used to augment both user queries and corpus documents.<n>We present an LLM-based retriever empowered to augment both user queries and corpus documents.<n>Our approach significantly enhances LLM-based retrieval performance in both sparse and dense settings.
arXiv Detail & Related papers (2025-06-23T14:14:43Z) - ZeroSearch: Incentivize the Search Capability of LLMs without Searching [69.55482019211597]
We introduce ZeroSearch, a framework that incentivizes the capabilities of large language models to use a real search engine with simulated searches during training.<n>Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents.
arXiv Detail & Related papers (2025-05-07T17:30:22Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance.
Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws.
We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Scaling Sparse and Dense Retrieval in Decoder-Only LLMs [20.173669986209024]
Scaling large language models (LLMs) has shown great potential for improving retrieval model performance.<n>Previous studies have mainly focused on dense retrieval trained with contrastive loss (CL)<n>Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks.
arXiv Detail & Related papers (2025-02-21T15:28:26Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval [72.2676180980573]
Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval.
We propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency.
Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios.
arXiv Detail & Related papers (2024-11-24T09:27:43Z) - Achieving Peak Performance for Large Language Models: A Systematic Review [0.0]
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP)
As models grow into the trillion- parameter range, computational and memory costs increase significantly.
This makes it difficult for many researchers to access the resources needed to train or apply these models.
arXiv Detail & Related papers (2024-09-07T13:57:41Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.