Improving Scientific Document Retrieval with Academic Concept Index
- URL: http://arxiv.org/abs/2601.00567v1
- Date: Fri, 02 Jan 2026 04:47:49 GMT
- Title: Improving Scientific Document Retrieval with Academic Concept Index
- Authors: Jeyun Lee, Junhyoung Lee, Wonbin Kweon, Bowen Jin, Yu Zhang, Susik Yoon, Dongha Lee, Hwanjo Yu, Jiawei Han, Seongku Kang,
- Abstract summary: Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations.<n>Recent approaches address these issues through two independent directions.<n>We introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy.
- Score: 47.95234352955763
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations and the substantial mismatch in vocabulary and information needs. Recent approaches address these issues through two independent directions that leverage large language models (LLMs): (1) generating synthetic queries for fine-tuning, and (2) generating auxiliary contexts to support relevance matching. However, both directions overlook the diverse academic concepts embedded within scientific documents, often producing redundant or conceptually narrow queries and contexts. To address this limitation, we introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy. This structured index serves as a foundation for improving both directions. First, we enhance the synthetic query generation with concept coverage-based generation (CCQGen), which adaptively conditions LLMs on uncovered concepts to generate complementary queries with broader concept coverage. Second, we strengthen the context augmentation with concept-focused auxiliary contexts (CCExpand), which leverages a set of document snippets that serve as concise responses to the concept-aware CCQGen queries. Extensive experiments show that incorporating the academic concept index into both query generation and context augmentation leads to higher-quality queries, better conceptual alignment, and improved retrieval performance.
Related papers
- MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation [17.382062394739588]
Large language models (LLMs) struggle with high-level conceptual understanding and holistic comprehension due to limited context windows.<n>We introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding.<n>Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process.
arXiv Detail & Related papers (2025-11-26T05:00:03Z) - Domain-Specific Data Generation Framework for RAG Adaptation [58.20906914537952]
Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models with external retrieval to enable domain-grounded responses.<n>We propose RAGen, a framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches.
arXiv Detail & Related papers (2025-10-13T09:59:49Z) - PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval [41.064644438540135]
Pairwise Semantic Matching (PairSem) is a framework that represents relevant semantics as entity-aspect pairs.<n>Experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance.
arXiv Detail & Related papers (2025-10-10T22:21:49Z) - Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey [21.764997953030857]
Modern information retrieval must reconcile ambiguous queries with diverse and dynamic corpora.<n>We organize recent work along four complementary dimensions: the point of injection, grounding and interaction, learning and alignment, and knowledge-graph integration.<n>The survey compares traditional and neural QE across seven aspects and maps applications in web search, biomedicine, e-commerce, open-domain question answering/RAG, conversational and code search, and cross-lingual settings.
arXiv Detail & Related papers (2025-09-09T14:31:11Z) - Reasoning-enhanced Query Understanding through Decomposition and Interpretation [87.56450566014625]
ReDI is a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation.<n>We compiled a large-scale dataset of real-world complex queries from a major search engine.<n> Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms.
arXiv Detail & Related papers (2025-09-08T10:58:42Z) - Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering [51.7493726399073]
We present a discourse-aware hierarchical framework to enhance long document question answering.<n>The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval.
arXiv Detail & Related papers (2025-05-26T14:45:12Z) - Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation [49.29180578078616]
Concept Coverage-based Query set Generation (CCQGen) framework designed to generate a set of queries with comprehensive coverage of the document's concepts.<n>We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation.<n>This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document.
arXiv Detail & Related papers (2025-02-16T15:59:50Z) - Taxonomy-guided Semantic Indexing for Academic Paper Search [51.07749719327668]
TaxoIndex is a semantic index framework for academic paper search.
It organizes key concepts from papers as a semantic index guided by an academic taxonomy.
It can be flexibly employed to enhance existing dense retrievers.
arXiv Detail & Related papers (2024-10-25T00:00:17Z) - Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning [7.086262532457526]
We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature.
We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the explosion involved in inferring links across papers.
arXiv Detail & Related papers (2024-09-23T15:20:27Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - A Linguistically Driven Framework for Query Expansion via Grammatical
Constituent Highlighting and Role-Based Concept Weighting [0.0]
Concepts-of-Interest are recognized as the core concepts that represent the gist of the search goal.
The remaining query constituents which serve to specify the search goal and complete the query structure are classified as descriptive, relational or structural.
arXiv Detail & Related papers (2020-04-25T01:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.