Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents
- URL: http://arxiv.org/abs/2601.08841v1
- Date: Fri, 19 Dec 2025 20:17:34 GMT
- Title: Triples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents
- Authors: Mihael Arcan,
- Abstract summary: We explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers.<n>We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations.<n>Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance.
- Score: 2.115174610040722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers. We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats that integrate both. Using a filtered arXiv corpus, we extract relational triples from abstracts and construct four text representations, which we embed using four state-of-the-art transformer models: MiniLM, MPNet, SciBERT, and SPECTER. We evaluate the resulting embeddings with KMeans, GMM, and HDBSCAN for unsupervised clustering, and fine-tune classification models for arXiv subject prediction. Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance, reaching up to 92.6% accuracy and 0.925 macro-F1. We also find that lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models (SciBERT, SPECTER) in clustering, while SciBERT excels in structured-input classification. These findings highlight the complementary benefits of combining unstructured text with structured knowledge, offering new insights into knowledge-infused representations for semantic organization of scientific documents.
Related papers
- Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery [5.300910554558862]
Existing methods often tie the structure to the number of classes and underutilize the rich prototype information available at intermediate hierarchical levels.<n>We introduce deep taxonomic networks, a novel deep latent variable approach designed to bridge these gaps.
arXiv Detail & Related papers (2025-09-28T03:13:32Z) - Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering [59.54662810933882]
Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models, often lack coherence and granularity.<n>We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering.
arXiv Detail & Related papers (2025-09-23T15:12:58Z) - HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization [0.0]
HERCULES is an algorithm and Python package designed for hierarchical k-means clustering of diverse data types.<n>It generates semantically rich titles and descriptions for clusters at each level of the hierarchy.<n>An interactive visualization tool facilitates thorough analysis and understanding of the clustering results.
arXiv Detail & Related papers (2025-06-24T20:22:00Z) - How Compositional Generalization and Creativity Improve as Diffusion Models are Trained [82.08869888944324]
How many samples do generative models need in order to learn composition rules?<n>What signal in the data is exploited to learn those rules?<n>We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.
arXiv Detail & Related papers (2025-02-17T18:06:33Z) - Information-Theoretic Generative Clustering of Documents [24.56214029342293]
We present generative clustering (GC) for clustering a set of documents, $mathrmX$.<n>Because large language models (LLMs) provide probability distributions, the similarity between two documents can be rigorously defined.<n>We show GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin.
arXiv Detail & Related papers (2024-12-18T06:21:21Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling [0.0]
This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts.
The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality.
arXiv Detail & Related papers (2024-04-16T05:21:47Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Group Collaborative Learning for Co-Salient Object Detection [152.67721740487937]
We present a novel group collaborative learning framework (GCoNet) capable of detecting co-salient objects in real time (16ms)
Extensive experiments on three challenging benchmarks, i.e., CoCA, CoSOD3k, and Cosal2015, demonstrate that our simple GCoNet outperforms 10 cutting-edge models and achieves the new state-of-the-art.
arXiv Detail & Related papers (2021-03-15T13:16:03Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.