A Comparison of Approaches for Imbalanced Classification Problems in the
Context of Retrieving Relevant Documents for an Analysis
- URL: http://arxiv.org/abs/2205.01600v1
- Date: Tue, 3 May 2022 16:22:42 GMT
- Title: A Comparison of Approaches for Imbalanced Classification Problems in the
Context of Retrieving Relevant Documents for an Analysis
- Authors: Sandra Wankm\"uller
- Abstract summary: The study compares query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning.
Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the first steps in many text-based social science studies is to
retrieve documents that are relevant for the analysis from large corpora of
otherwise irrelevant documents. The conventional approach in social science to
address this retrieval task is to apply a set of keywords and to consider those
documents to be relevant that contain at least one of the keywords. But the
application of incomplete keyword lists risks drawing biased inferences. More
complex and costly methods such as query expansion techniques, topic
model-based classification rules, and active as well as passive supervised
learning could have the potential to more accurately separate relevant from
irrelevant documents and thereby reduce the potential size of bias. Yet,
whether applying these more expensive approaches increases retrieval
performance compared to keyword lists at all, and if so, by how much, is
unclear as a comparison of these approaches is lacking. This study closes this
gap by comparing these methods across three retrieval tasks associated with a
data set of German tweets (Linder, 2017), the Social Bias Inference Corpus
(SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results
show that query expansion techniques and topic model-based classification rules
in most studied settings tend to decrease rather than increase retrieval
performance. Active supervised learning, however, if applied on a not too small
set of labeled training instances (e.g. 1,000 documents), reaches a
substantially higher retrieval performance than keyword lists.
Related papers
- JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - ExcluIR: Exclusionary Neural Information Retrieval [74.08276741093317]
We present ExcluIR, a set of resources for exclusionary retrieval.
evaluation benchmark includes 3,452 high-quality exclusionary queries.
training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document.
arXiv Detail & Related papers (2024-04-26T09:43:40Z) - Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation [16.170841777591345]
In most social search scenarios such as Dianping, modeling search relevance always faces two challenges.
We first take queryd with the query-based summary and the document summary without query as the input of topic relevance model.
Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data.
arXiv Detail & Related papers (2024-04-03T10:05:47Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - CODER: An efficient framework for improving retrieval through
COntextualized Document Embedding Reranking [11.635294568328625]
We present a framework for improving the performance of a wide class of retrieval models at minimal computational cost.
It utilizes precomputed document representations extracted by a base dense retrieval method.
It incurs a negligible computational overhead on top of any first-stage method at run time, allowing it to be easily combined with any state-of-the-art dense retrieval method.
arXiv Detail & Related papers (2021-12-16T10:25:26Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Efficient Clustering from Distributions over Topics [0.0]
We present an approach that relies on the results of a topic modeling algorithm over documents in a collection as a means to identify smaller subsets of documents where the similarity function can be computed.
This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications.
arXiv Detail & Related papers (2020-12-15T10:52:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.