Computer-Assisted Creation of Boolean Search Rules for Text
Classification in the Legal Domain
- URL: http://arxiv.org/abs/2112.05807v1
- Date: Fri, 10 Dec 2021 19:53:41 GMT
- Title: Computer-Assisted Creation of Boolean Search Rules for Text
Classification in the Legal Domain
- Authors: Hannes Westermann, Jaromir Savelka, Vern R. Walker, Kevin D. Ashley,
Karim Benyekhlef
- Abstract summary: We develop an interactive environment called CASE which exploits word co-occurrence to guide human annotators in selection of relevant search terms.
The system seamlessly facilitates iterative evaluation and improvement of the classification rules.
We evaluate classifiers created with our CASE system on 4 datasets, and compare the results to machine learning methods.
- Score: 0.5249805590164901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a method of building strong, explainable
classifiers in the form of Boolean search rules. We developed an interactive
environment called CASE (Computer Assisted Semantic Exploration) which exploits
word co-occurrence to guide human annotators in selection of relevant search
terms. The system seamlessly facilitates iterative evaluation and improvement
of the classification rules. The process enables the human annotators to
leverage the benefits of statistical information while incorporating their
expert intuition into the creation of such rules. We evaluate classifiers
created with our CASE system on 4 datasets, and compare the results to machine
learning methods, including SKOPE rules, Random forest, Support Vector Machine,
and fastText classifiers. The results drive the discussion on trade-offs
between superior compactness, simplicity, and intuitiveness of the Boolean
search rules versus the better performance of state-of-the-art machine learning
models for text classification.
Related papers
- DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers [18.279429202248632]
We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations.
DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models.
We show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
arXiv Detail & Related papers (2024-10-29T17:04:55Z) - Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space.
Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z) - RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules [30.239044569301534]
Weakly supervised text classification (WSTC) has attracted increasing attention due to its applicability in classifying a mass of texts.
We propose a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module.
Our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.
arXiv Detail & Related papers (2024-03-05T12:50:36Z) - Hierarchical Indexing for Retrieval-Augmented Opinion Summarization [60.5923941324953]
We propose a method for unsupervised abstractive opinion summarization that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs)
Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy.
At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews.
arXiv Detail & Related papers (2024-03-01T10:38:07Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Prompt Algebra for Task Composition [131.97623832435812]
We consider Visual Language Models with prompt tuning as our base classifier.
We propose constrained prompt tuning to improve performance of the composite classifier.
On UTZappos it improves classification accuracy over the best base model by 8.45% on average.
arXiv Detail & Related papers (2023-06-01T03:20:54Z) - A Meta-Learning Algorithm for Interrogative Agendas [3.0969191504482247]
We focus on formal concept analysis (FCA), a standard knowledge representation formalism, to express interrogative agendas.
Several FCA-based algorithms have already been in use for standard machine learning tasks such as classification and outlier detection.
In this paper, we propose a meta-learning algorithm to construct a good interrogative agenda explaining the data.
arXiv Detail & Related papers (2023-01-04T22:09:36Z) - Perturbations and Subpopulations for Testing Robustness in Token-Based
Argument Unit Recognition [6.502694770864571]
Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against.
One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences.
Previous research suggests that fine-tuning language models on the token-level yields more robust results for classifying sentences compared to training on sentences directly.
We reproduce the study that originally made this claim and further investigate what exactly token-based systems learned better compared to sentence-based ones.
arXiv Detail & Related papers (2022-09-29T13:44:28Z) - Classifying Scientific Publications with BERT -- Is Self-Attention a
Feature Selection Method? [0.0]
We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles.
We observe how self-attention focuses on words that are highly related to the domain of the article.
We compare and evaluate the subset of the most attended words with feature selection methods normally used for text classification.
arXiv Detail & Related papers (2021-01-20T13:22:26Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.