Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review
- URL: http://arxiv.org/abs/2510.17892v1
- Date: Sat, 18 Oct 2025 22:46:53 GMT
- Title: Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review
- Authors: Zhyar Rzgar K. Rostam, Gábor Kertész,
- Abstract summary: Large language models (LLMs) have achieved remarkable success in NLP.<n>Their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions.<n>We investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification.
- Score: 0.42970700836450487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The exponential increase in scientific literature and online information necessitates efficient methods for extracting knowledge from textual data. Natural language processing (NLP) plays a crucial role in addressing this challenge, particularly in text classification tasks. While large language models (LLMs) have achieved remarkable success in NLP, their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions. In this systematic literature review (SLR), we investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification. We systematically review 41 articles published between 2018 and January 2024, adhering to the PRISMA statement (preferred reporting items for systematic reviews and meta-analyses). This review methodology involved rigorous inclusion criteria and a multi-step selection process employing AI-powered tools. We delve into the evolution of text classification techniques and differentiate between traditional and modern approaches. We emphasize transformer-based models and explore the challenges and considerations associated with using LLMs for domain-specific text classification. Furthermore, we categorize existing research based on various PLMs and propose a taxonomy of techniques used in the field. To validate our findings, we conducted a comparative experiment involving BERT, SciBERT, and BioBERT in biomedical sentence classification. Finally, we present a comparative study on the performance of LLMs in text classification tasks across different domains. In addition, we examine recent advancements in PLMs for domain-specific text classification and offer insights into future directions and limitations in this rapidly evolving domain.
Related papers
- How Do LLM-Generated Texts Impact Term-Based Retrieval Models? [76.92519309816008]
This paper investigates the influence of large language models (LLMs) on term-based retrieval models.<n>Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes.<n>Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries.
arXiv Detail & Related papers (2025-08-25T06:43:27Z) - Large Language Models for Healthcare Text Classification: A Systematic Review [4.8342038441006805]
Large Language Models (LLMs) have fundamentally transformed approaches to Natural Language Processing (NLP)<n>In healthcare, accurate and cost-efficient text classification is crucial, whether for clinical notes analysis, diagnosis coding, or any other task.<n>Numerous studies have been conducted to leverage LLMs for automated healthcare text classification.
arXiv Detail & Related papers (2025-03-03T04:16:13Z) - The Text Classification Pipeline: Starting Shallow going Deeper [4.97309503788908]
The past decade has seen deep learning revolutionize text classification.<n>English is the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others.<n>This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification.
arXiv Detail & Related papers (2024-12-30T23:01:19Z) - Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study [0.42970700836450487]
Large Language Models (LLMs) based on transformer architectures have shown significant success in natural language processing (NLP) tasks.<n>However, general-purpose LLMs often struggle with domain-specific content, such as scientific texts, due to unique challenges like specialized vocabulary and imbalanced data.<n>In this study, we fine-tune four state-of-the-art LLMs BERT, SciBERT, BioBERT, and BlueBERT on three datasets to evaluate their performance in scientific text classification.
arXiv Detail & Related papers (2024-11-27T18:58:53Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - SciPrompt: Knowledge-augmented Prompting for Fine-grained Categorization of Scientific Topics [2.3742710594744105]
We introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text classification tasks.
Our method outperforms state-of-the-art, prompt-based fine-tuning methods on scientific text classification tasks under few and zero-shot settings.
arXiv Detail & Related papers (2024-10-02T18:45:04Z) - AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining [3.8384235322772864]
We present the AHAM' methodology and a metric that guides the domain-specific textbfadaptation of the BERTopic topic modeling framework.
By utilizing the LLaMa2 generative language model, we generate topic definitions via one-shot learning.
For inter-topic similarity evaluation, we leverage metrics from language generation and translation processes.
arXiv Detail & Related papers (2023-12-25T18:23:03Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - Deep Learning Based Text Classification: A Comprehensive Review [75.8403533775179]
We provide a review of more than 150 deep learning based models for text classification developed in recent years.
We also provide a summary of more than 40 popular datasets widely used for text classification.
arXiv Detail & Related papers (2020-04-06T02:00:30Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.