Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
- URL: http://arxiv.org/abs/2409.02078v1
- Date: Tue, 3 Sep 2024 17:26:17 GMT
- Title: Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
- Authors: Michael Burnham, Kayla Kahn, Ryan Yank Wang, Rachel X. Peng,
- Abstract summary: Large language models can annotate documents without supervised training, an ability known as zero-shot learning.
This paper introduces the Political DEBATE language models for zero-shot and few-shot classification of political documents.
We release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
Related papers
- Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification [4.4467858321751015]
We benchmark language models from 77M to 40B parameters using different architectures and scoring functions.
Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts.
This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
arXiv Detail & Related papers (2024-04-17T07:10:28Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use [14.2527771630478]
We propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions.
Our framework eliminates the need for crowd-sourced annotations.
Our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models.
arXiv Detail & Related papers (2024-03-05T03:34:11Z) - NLLB-CLIP -- train performant multilingual image retrieval model on a
budget [65.268245109828]
We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
arXiv Detail & Related papers (2023-09-04T23:26:11Z) - Zero-Shot Learners for Natural Language Understanding via a Unified
Multiple Choice Perspective [26.41585967095811]
Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training.
Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN.
Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification.
arXiv Detail & Related papers (2022-10-16T17:24:06Z) - ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling [57.80052276304937]
This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task.
We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset.
arXiv Detail & Related papers (2022-01-04T20:08:17Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z) - A Large-Scale Chinese Short-Text Conversation Dataset [77.55813366932313]
We present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues)
The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules.
We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively.
arXiv Detail & Related papers (2020-08-10T08:12:49Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z) - Training Keyword Spotters with Limited and Synthesized Speech Data [14.476868092174636]
We show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.
We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy.
arXiv Detail & Related papers (2020-01-31T07:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.