Retrieval-based Text Selection for Addressing Class-Imbalanced Data in
Classification
- URL: http://arxiv.org/abs/2307.14899v2
- Date: Thu, 9 Nov 2023 19:39:23 GMT
- Title: Retrieval-based Text Selection for Addressing Class-Imbalanced Data in
Classification
- Authors: Sareh Ahmadi, Aditya Shah, Edward Fox
- Abstract summary: This paper addresses the problem of selecting a set of texts for annotation in text classification using retrieval methods.
An additional challenge is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance.
We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers.
- Score: 0.6650227510403052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of selecting of a set of texts for
annotation in text classification using retrieval methods when there are limits
on the number of annotations due to constraints on human resources. An
additional challenge addressed is dealing with binary categories that have a
small number of positive instances, reflecting severe class imbalance. In our
situation, where annotation occurs over a long time period, the selection of
texts to be annotated can be made in batches, with previous annotations guiding
the choice of the next set. To address these challenges, the paper proposes
leveraging SHAP to construct a quality set of queries for Elasticsearch and
semantic search, to try to identify optimal sets of texts for annotation that
will help with class imbalance. The approach is tested on sets of cue texts
describing possible future events, constructed by participants involved in
studies aimed to help with the management of obesity and diabetes. We introduce
an effective method for selecting a small set of texts for annotation and
building high-quality classifiers. We integrate vector search, semantic search,
and machine learning classifiers to yield a good solution. Our experiments
demonstrate improved F1 scores for the minority classes in binary
classification.
Related papers
- Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - IDEAL: Influence-Driven Selective Annotations Empower In-Context
Learners in Large Language Models [66.32043210237768]
This paper introduces an influence-driven selective annotation method.
It aims to minimize annotation costs while improving the quality of in-context examples.
Experiments confirm the superiority of the proposed method on various benchmarks.
arXiv Detail & Related papers (2023-10-16T22:53:54Z) - Prefer to Classify: Improving Text Classifiers via Auxiliary Preference
Learning [76.43827771613127]
In this paper, we investigate task-specific preferences between pairs of input texts as a new alternative way for such auxiliary data annotation.
We propose a novel multi-task learning framework, called prefer-to-classify (P2C), which can enjoy the cooperative effect of learning both the given classification task and the auxiliary preferences.
arXiv Detail & Related papers (2023-06-08T04:04:47Z) - Task-Specific Embeddings for Ante-Hoc Explainable Text Classification [6.671252951387647]
We propose an alternative training objective in which we learn task-specific embeddings of text.
Our proposed objective learns embeddings such that all texts that share the same target class label should be close together.
We present extensive experiments which show that the benefits of ante-hoc explainability and incremental learning come at no cost in overall classification accuracy.
arXiv Detail & Related papers (2022-11-30T19:56:25Z) - Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts.
The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Conical Classification For Computationally Efficient One-Class Topic
Determination [0.0]
We propose a Conical classification approach to identify documents that relate to a particular topic.
We show in our analysis that our approach has higher predictive power on our datasets, and is also faster to compute.
arXiv Detail & Related papers (2021-10-31T01:27:12Z) - OPAD: An Optimized Policy-based Active Learning Framework for Document
Content Analysis [6.159771892460152]
We propose textitOPAD, a novel framework using reinforcement policy for active learning in content detection tasks for documents.
The framework learns the acquisition function to decide the samples to be selected while optimizing performance metrics.
We show superior performance of the proposed textitOPAD framework for active learning for various tasks related to document understanding.
arXiv Detail & Related papers (2021-10-01T07:40:56Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Rank over Class: The Untapped Potential of Ranking in Natural Language
Processing [8.637110868126546]
We argue that many tasks which are currently addressed using classification are in fact being shoehorned into a classification mould.
We propose a novel end-to-end ranking approach consisting of a Transformer network responsible for producing representations for a pair of text sequences.
In an experiment on a heavily-skewed sentiment analysis dataset, converting ranking results to classification labels yields an approximately 22% improvement over state-of-the-art text classification.
arXiv Detail & Related papers (2020-09-10T22:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.