Automated Research Article Classification and Recommendation Using NLP and ML
- URL: http://arxiv.org/abs/2510.05495v1
- Date: Tue, 07 Oct 2025 01:24:35 GMT
- Title: Automated Research Article Classification and Recommendation Using NLP and ML
- Authors: Shadikur Rahman, Hasibul Karim Shanto, Umme Ayman Koana, Syed Muhammad Danish,
- Abstract summary: This paper presents an automated framework for research article classification and recommendation.<n>We use a large-scale arXiv.org dataset spanning more than three decades.<n>To complement classification, we incorporate a recommendation module based on the cosine similarity of vectorized articles.
- Score: 0.5486463492959637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the digital era, the exponential growth of scientific publications has made it increasingly difficult for researchers to efficiently identify and access relevant work. This paper presents an automated framework for research article classification and recommendation that leverages Natural Language Processing (NLP) techniques and machine learning. Using a large-scale arXiv.org dataset spanning more than three decades, we evaluate multiple feature extraction approaches (TF--IDF, Count Vectorizer, Sentence-BERT, USE, Mirror-BERT) in combination with diverse machine learning classifiers (Logistic Regression, SVM, Na\"ive Bayes, Random Forest, Gradient Boosted Trees, and k-Nearest Neighbour). Our experiments show that Logistic Regression with TF--IDF consistently yields the best classification performance, achieving an accuracy of 69\%. To complement classification, we incorporate a recommendation module based on the cosine similarity of vectorized articles, enabling efficient retrieval of related research papers. The proposed system directly addresses the challenge of information overload in digital libraries and demonstrates a scalable, data-driven solution to support literature discovery.
Related papers
- Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification [0.0]
This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy)<n>The experimental results show that the use of the NeSy approach matched strong results despite a limited dataset.
arXiv Detail & Related papers (2025-12-18T15:53:10Z) - Comparison of Machine Learning Models to Classify Documents on Digital Development [0.0]
This study employs a publicly available document database on worldwide digital development interventions categorised under twelve areas.<n>The study concludes that the amount of data is not the sole factor affecting the performance; features like similarity within classes and dissimilarity among classes are also crucial.
arXiv Detail & Related papers (2025-10-01T09:53:28Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [48.73595915402094]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Enriched BERT Embeddings for Scholarly Publication Classification [0.13654846342364302]
The NSLP 2024 FoRC Task I addresses this challenge organized as a competition.
The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article.
arXiv Detail & Related papers (2024-05-07T09:05:20Z) - Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling [0.0]
This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts.
The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality.
arXiv Detail & Related papers (2024-04-16T05:21:47Z) - Tuning Traditional Language Processing Approaches for Pashto Text
Classification [0.0]
The main aim of this study is to establish a Pashto automatic text classification system.
This study compares several models containing both statistical and neural network machine learning techniques.
This research obtained average testing accuracy rate 94% using classification algorithm and TFIDF feature extraction method.
arXiv Detail & Related papers (2023-05-04T22:57:45Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.