The Application of Active Query K-Means in Text Classification
- URL: http://arxiv.org/abs/2107.07682v1
- Date: Fri, 16 Jul 2021 03:06:35 GMT
- Title: The Application of Active Query K-Means in Text Classification
- Authors: Yukun Jiang
- Abstract summary: Active learning is a state-of-art machine learning approach to deal with an abundance of unlabeled data.
Traditional unsupervised k-means clustering is first modified into a semi-supervised version in this research.
A novel attempt is applied to further extend the algorithm into active learning scenario with Penalized Min-Max-selection.
After tested on a Chinese news dataset, it shows a consistent increase in accuracy while lowering the cost in training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Active learning is a state-of-art machine learning approach to deal with an
abundance of unlabeled data. In the field of Natural Language Processing,
typically it is costly and time-consuming to have all the data annotated. This
inefficiency inspires out our application of active learning in text
classification. Traditional unsupervised k-means clustering is first modified
into a semi-supervised version in this research. Then, a novel attempt is
applied to further extend the algorithm into active learning scenario with
Penalized Min-Max-selection, so as to make limited queries that yield more
stable initial centroids. This method utilizes both the interactive query
results from users and the underlying distance representation. After tested on
a Chinese news dataset, it shows a consistent increase in accuracy while
lowering the cost in training.
Related papers
- Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain [0.24999074238880487]
This paper presents an approach to natural language data stream classification using the sentence space method.
It allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data.
Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification.
arXiv Detail & Related papers (2024-07-15T15:23:21Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models [3.546617486894182]
We introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks.
Results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets.
arXiv Detail & Related papers (2024-06-13T15:06:11Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Cache & Distil: Optimising API Calls to Large Language Models [82.32065572907125]
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries.
To curtail the frequency of these calls, one can employ a smaller language model -- a student.
This student gradually gains proficiency in independently handling an increasing number of user requests.
arXiv Detail & Related papers (2023-10-20T15:01:55Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - TAAL: Test-time Augmentation for Active Learning in Medical Image
Segmentation [7.856339385917824]
This paper proposes Test-time Augmentation for Active Learning (TAAL), a novel semi-supervised active learning approach for segmentation.
Our results on a publicly-available dataset of cardiac images show that TAAL outperforms existing baseline methods in both fully-supervised and semi-supervised settings.
arXiv Detail & Related papers (2023-01-16T22:19:41Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling [57.80052276304937]
This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task.
We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.
arXiv Detail & Related papers (2022-01-04T20:08:17Z) - Bayesian active learning for production, a systematic study and a
reusable library [85.32971950095742]
In this paper, we analyse the main drawbacks of current active learning techniques.
We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process.
We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size.
arXiv Detail & Related papers (2020-06-17T14:51:11Z) - R\'{e}nyi Entropy Bounds on the Active Learning Cost-Performance
Tradeoff [27.436483977171328]
Semi-supervised classification studies how to combine the statistical knowledge of the often abundant unlabeled data with the often limited labeled data in order to maximize overall classification accuracy.
In this paper, we initiate the non-asymptotic analysis of the optimal policy for semi-supervised classification with actively obtained labeled data.
We provide the first characterization of the jointly optimal active learning and semi-supervised classification policy, in terms of the cost-performance tradeoff driven by the label query budget and overall classification accuracy.
arXiv Detail & Related papers (2020-02-05T22:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.