ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling
- URL: http://arxiv.org/abs/2201.01337v1
- Date: Tue, 4 Jan 2022 20:08:17 GMT
- Title: ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling
- Authors: Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo
Bustos, Andr\'e Seidel Oliveira, Bruno Miguel Veloso, Fabio Levy Siqueira,
Anna Helena Reali Costa
- Abstract summary: This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task.
We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.
- Score: 57.80052276304937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional text classification approaches often require a good amount of
labeled data, which is difficult to obtain, especially in restricted domains or
less widespread languages. This lack of labeled data has led to the rise of
low-resource methods, that assume low data availability in natural language
processing. Among them, zero-shot learning stands out, which consists of
learning a classifier without any previously labeled data. The best results
reported with this approach use language models such as Transformers, but fall
into two problems: high execution time and inability to handle long texts as
input. This paper proposes a new model, ZeroBERTo, which leverages an
unsupervised clustering step to obtain a compressed data representation before
the classification task. We show that ZeroBERTo has better performance for long
inputs and shorter execution time, outperforming XLM-R by about 12% in the F1
score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data,
Zero-Shot Learning, Topic Modeling, Transformers.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed
Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language
Models [0.0]
Hate speech detection in mix-code low-resource languages is an active problem area where the use of Large Language Models has proven beneficial.
In this study, we have compiled a dataset of 100 YouTube comments, and weakly labelled them for coarse and fine-grained misogyny classification in mix-code Hinglish.
Out of all the approaches, zero-shot classification using the Bidirectional Auto-Regressive Transformers (BART) large model and few-shot prompting using Generative Pre-trained Transformer- 3 (ChatGPT-3) achieve the best results.
arXiv Detail & Related papers (2024-03-04T15:27:49Z) - STENCIL: Submodular Mutual Information Based Weak Supervision for Cold-Start Active Learning [1.9116784879310025]
We present STENCIL, which improves overall accuracy by $10%-18%$ and rare-class F-1 score by $17%-40%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.
We show that STENCIL improves overall accuracy by $10%-18%$ and rare-class F-1 score by $17%-40%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.
arXiv Detail & Related papers (2024-02-21T01:54:58Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Zero-Shot Text Classification with Self-Training [8.68603153534916]
We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks.
Self-training adapts the zero-shot model to the task at hand.
arXiv Detail & Related papers (2022-10-31T17:55:00Z) - Beyond prompting: Making Pre-trained Language Models Better Zero-shot
Learners by Clustering Representations [24.3378487252621]
We show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of pre-trained language models.
Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning.
arXiv Detail & Related papers (2022-10-29T16:01:51Z) - Language Models in the Loop: Incorporating Prompting into Weak
Supervision [11.10422546502386]
We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited.
Instead of applying the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework.
arXiv Detail & Related papers (2022-05-04T20:42:40Z) - Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance.
Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency.
Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.