SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
- URL: http://arxiv.org/abs/2508.20514v2
- Date: Fri, 07 Nov 2025 02:06:48 GMT
- Title: SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM
- Authors: Pengjiang Li, Zaitian Wang, Xinhao Zhang, Ran Zhang, Lu Jiang, Pengfei Wang, Yuanchun Zhou,
- Abstract summary: We propose an advanced topic discovery method enhanced by large language models (LLMs) to improve scientific topic identification.<n> Specifically, we build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract.<n>We then construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs.<n>Experiments conducted on three real-world datasets demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods.
- Score: 19.949137890090814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
Related papers
- Intelligent Scientific Literature Explorer using Machine Learning (ISLE) [0.797970449705065]
This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction.<n>The proposed framework contributes a foundation for AI-assisted scientific discovery.
arXiv Detail & Related papers (2025-12-14T16:54:24Z) - SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature [52.36039386997026]
We introduce SciRAG, an open-source framework for scientific literature exploration.<n>We introduce three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution.
arXiv Detail & Related papers (2025-11-18T11:09:19Z) - Science Hierarchography: Hierarchical Organization of Science Literature [20.182213614072836]
We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure.<n>We develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting.<n>Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature.
arXiv Detail & Related papers (2025-04-18T17:59:29Z) - SciPIP: An LLM-based Scientific Paper Idea Proposer [30.670219064905677]
We introduce SciPIP, an innovative framework designed to enhance the proposal of scientific ideas through improvements in both literature retrieval and idea generation.<n>Our experiments, conducted across various domains such as natural language processing and computer vision, demonstrate SciPIP's capability to generate a multitude of innovative and useful ideas.
arXiv Detail & Related papers (2024-10-30T16:18:22Z) - SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding [22.131371019641417]
Despite Large Language Models' success, they face challenges in scientific literature understanding.<n>We propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT)<n>We present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding.
arXiv Detail & Related papers (2024-08-28T05:41:52Z) - pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy [2.6952253149772996]
Pathfinder is a machine learning framework designed to enable literature review and knowledge discovery in astronomy.
Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context.
It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes.
arXiv Detail & Related papers (2024-08-02T20:05:24Z) - SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.