CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
- URL: http://arxiv.org/abs/2508.13394v1
- Date: Mon, 18 Aug 2025 23:00:57 GMT
- Title: CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
- Authors: Lam Thanh Do, Linh Van Nguyen, David Fu, Kevin Chen-Chuan Chang,
- Abstract summary: We propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units.<n>We show that CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN.
- Score: 17.680327408224237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.
Related papers
- ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery [6.780086370528623]
We introduce textbfReSearch, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery.<n>ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture.<n>Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods.
arXiv Detail & Related papers (2026-01-20T17:27:12Z) - SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature [52.36039386997026]
We introduce SciRAG, an open-source framework for scientific literature exploration.<n>We introduce three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution.
arXiv Detail & Related papers (2025-11-18T11:09:19Z) - PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval [41.064644438540135]
Pairwise Semantic Matching (PairSem) is a framework that represents relevant semantics as entity-aspect pairs.<n>Experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance.
arXiv Detail & Related papers (2025-10-10T22:21:49Z) - Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking [32.40639079110799]
SemRank is an effective and efficient paper retrieval framework.<n>It combines query understanding with a concept-based semantic index.<n> Experiments show that SemRank consistently improves the performance of various base retrievers.
arXiv Detail & Related papers (2025-05-27T22:49:18Z) - Self-Compositional Data Augmentation for Scientific Keyphrase Generation [28.912937922090038]
We present a self-compositional data augmentation method for keyphrase generation.
We measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples.
arXiv Detail & Related papers (2024-11-05T12:22:51Z) - Taxonomy-guided Semantic Indexing for Academic Paper Search [51.07749719327668]
TaxoIndex is a semantic index framework for academic paper search.
It organizes key concepts from papers as a semantic index guided by an academic taxonomy.
It can be flexibly employed to enhance existing dense retrievers.
arXiv Detail & Related papers (2024-10-25T00:00:17Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - SimCKP: Simple Contrastive Learning of Keyphrase Representations [36.88517357720033]
We propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; and 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document.
arXiv Detail & Related papers (2023-10-12T11:11:54Z) - Retrieval Augmentation for Commonsense Reasoning: A Unified Approach [64.63071051375289]
We propose a unified framework of retrieval-augmented commonsense reasoning (called RACo)
Our proposed RACo can significantly outperform other knowledge-enhanced method counterparts.
arXiv Detail & Related papers (2022-10-23T23:49:08Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Keyphrase Extraction with Dynamic Graph Convolutional Networks and
Diversified Inference [50.768682650658384]
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document.
Recent Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks.
In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously.
arXiv Detail & Related papers (2020-10-24T08:11:23Z) - A Joint Learning Approach based on Self-Distillation for Keyphrase
Extraction from Scientific Documents [29.479331909227998]
Keyphrase extraction is the task of extracting a small set of phrases that best describe a document.
Most existing benchmark datasets for the task typically have limited numbers of annotated documents.
We propose a simple and efficient joint learning approach based on the idea of self-distillation.
arXiv Detail & Related papers (2020-10-22T18:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.