Related papers: Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

URL: http://arxiv.org/abs/2310.11258v2
Date: Tue, 24 Oct 2023 12:01:14 GMT
Title: Utilizing Weak Supervision To Generate Indonesian Conservation Dataset
Authors: Mega Fransiska, Diah Pitaloka, Saripudin, Satrio Putra, Lintang Sutawika
Abstract summary: Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text.
Score: 3.357014575278386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-class classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-class classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.

Related papers

Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting [0.0]
BERT, SciBERT, BioBERT, and BlueBERT are fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. We augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985's main classes. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy.
arXiv Detail & Related papers (2025-04-26T21:06:49Z)
A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets [0.0]
This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data. We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
arXiv Detail & Related papers (2024-09-09T18:10:05Z)
Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels. By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z)
A Self Supervised StyleGAN for Image Annotation and Classification with Extremely Limited Labels [35.43549147657739]
We propose SS-StyleGAN, a self-supervised approach for image annotation and classification suitable for extremely small annotated datasets. We show that the proposed method attains strong classification results using small labeled datasets of sizes 50 and even 10.
arXiv Detail & Related papers (2023-12-26T09:46:50Z)
Is margin all you need? An extensive empirical study of active learning on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark. Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z)
Active Transfer Prototypical Network: An Efficient Labeling Algorithm for Time-Series Data [1.7205106391379026]
This paper proposes a novel Few-Shot Learning (FSL)-based AL framework, which addresses the trade-off problem by incorporating a Prototypical Network (ProtoNet) in the AL iterations. This framework was validated on UCI HAR/HAPT dataset and a real-world braking maneuver dataset. The learning performance significantly surpasses traditional AL algorithms on both datasets, achieving 90% classification accuracy with 10% and 5% labeling effort, respectively.
arXiv Detail & Related papers (2022-09-28T16:14:40Z)
AutoGeoLabel: Automated Label Generation for Geospatial Machine Learning [69.47585818994959]
We evaluate a big data processing pipeline to auto-generate labels for remote sensing data. We utilize the big geo-data platform IBM PAIRS to dynamically generate such labels in dense urban areas.
arXiv Detail & Related papers (2022-01-31T20:02:22Z)
Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost. Our proposed method does not have any parameters to be tuned, making it dataset-independent. Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z)
The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming [11.536162323162099]
Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples. Hand-labelling vast amounts of data may be tedious, expensive, and error-prone.
arXiv Detail & Related papers (2021-08-24T19:11:28Z)
Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts. We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z)
SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z)
Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled Learning and Conditional Generation with Extra Data [77.31213472792088]
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. We address this problem by leveraging Positive-Unlabeled(PU) classification and the conditional generation with extra unlabeled data. We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data.
arXiv Detail & Related papers (2020-06-14T08:27:40Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.