Clustering scientific publications: lessons learned through experiments with a real citation network
- URL: http://arxiv.org/abs/2505.18180v1
- Date: Thu, 15 May 2025 14:27:53 GMT
- Title: Clustering scientific publications: lessons learned through experiments with a real citation network
- Authors: Vu Thi Huong, Thorsten Koch,
- Abstract summary: Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to model citation networks.<n>This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.
Related papers
- Author-Specific Linguistic Patterns Unveiled: A Deep Learning Study on Word Class Distributions [0.0]
This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis.<n>By leveraging deep neural networks, we classify literary authors based on POS tag vectors and bigram frequency matrices derived from their works.
arXiv Detail & Related papers (2025-01-17T09:43:49Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - XAI for Self-supervised Clustering of Wireless Spectrum Activity [0.5809784853115825]
We propose a methodology for explaining deep clustering, self-supervised learning architectures.
For the representation learning part, our methodology employs Guided Backpropagation to interpret the regions of interest of the input data.
For the clustering part, the methodology relies on Shallow Trees to explain the clustering result.
Finally, a data-specific visualizations part enables connection for each of the clusters to the input data trough the relevant features.
arXiv Detail & Related papers (2023-05-17T08:56:43Z) - On Learning the Structure of Clusters in Graphs [3.8073142980733]
In many real-world applications, we find that the clusters have a significant high-level structure.
This is often overlooked in the design and analysis of graph clustering algorithms.
This thesis addresses the natural question of whether the structure of clusters can be learned efficiently.
arXiv Detail & Related papers (2022-12-29T15:26:19Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Hub-aware Random Walk Graph Embedding Methods for Classification [44.99833362998488]
We propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem.
The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real-world networks.
arXiv Detail & Related papers (2022-09-15T20:41:18Z) - Deep Graph Learning for Anomalous Citation Detection [55.81334139806342]
We propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks.
Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts.
arXiv Detail & Related papers (2022-02-23T09:05:28Z) - Effective and Efficient Graph Learning for Multi-view Clustering [173.8313827799077]
We propose an effective and efficient graph learning model for multi-view clustering.
Our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm.
Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size.
arXiv Detail & Related papers (2021-08-15T13:14:28Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Higher-Order Spectral Clustering of Directed Graphs [8.997952791113232]
Clustering is an important topic in algorithms, and has a number of applications in machine learning, computer vision, statistics, and several other research disciplines.
We present a nearly-linear time algorithm for digraph clustering, and show that our proposed algorithm can be implemented in sublinear time under reasonable assumptions.
arXiv Detail & Related papers (2020-11-10T13:06:37Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Structured Graph Learning for Clustering and Semi-supervised
Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data.
Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure.
Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.