Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering
- URL: http://arxiv.org/abs/2509.19125v1
- Date: Tue, 23 Sep 2025 15:12:58 GMT
- Title: Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering
- Authors: Kun Zhu, Lizi Liao, Yuxuan Gu, Lei Huang, Xiaocheng Feng, Bing Qin,
- Abstract summary: Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models, often lack coherence and granularity.<n>We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering.
- Score: 59.54662810933882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.
Related papers
- CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering [36.76110608580489]
Large Language Models (LLMs) are increasingly used for question answering over scientific research papers.<n>Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers.<n>We propose a method that augments LLMs' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs.
arXiv Detail & Related papers (2026-01-29T13:53:44Z) - SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation [4.512335376984058]
Large language models (LLMs) are increasingly adopted for automating survey paper generation.<n>We propose textbfSurveyG, an LLM-based agent framework that integrates textithierarchical citation graph<n>The graph is organized into three layers: textbfFoundation, textbfDevelopment, and textbfFrontier, to capture the evolution of research from seminal works to incremental advances and emerging directions.
arXiv Detail & Related papers (2025-10-09T03:14:20Z) - Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization [49.61589046694085]
We propose a topic-guided reinforcement learning approach to improve content selection in Multi-Document Summarization.<n>We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries.
arXiv Detail & Related papers (2025-09-11T21:01:54Z) - A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora [6.384357773998868]
Sci-OG is a semi-auto-mated methodology for generating research topic.<n>This paper presents Sci-OG, a semi-auto-mated methodology for generating research topic.<n>We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples.
arXiv Detail & Related papers (2025-08-06T08:48:14Z) - Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z) - Science Hierarchography: Hierarchical Organization of Science Literature [20.182213614072836]
We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure.<n>We develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting.<n>Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature.
arXiv Detail & Related papers (2025-04-18T17:59:29Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Taxonomy Tree Generation from Citation Graph [15.188580557890942]
HiGTL is a novel end-to-end framework guided by human-provided instructions or preferred topics.<n>We develop a novel taxonomy node verbalization strategy that iteratively generates central concepts for each cluster.<n>Experiments demonstrate that HiGTL effectively produces coherent, high-quality concept.
arXiv Detail & Related papers (2024-10-02T13:02:03Z) - Text Clustering as Classification with LLMs [9.128151647718251]
We propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of Large Language Models.<n>By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention.<n> Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Incremental hierarchical text clustering methods: a review [49.32130498861987]
This study aims to analyze various hierarchical and incremental clustering techniques.
The main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering.
arXiv Detail & Related papers (2023-12-12T22:27:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.