Data-Centric Machine Learning in the Legal Domain
- URL: http://arxiv.org/abs/2201.06653v1
- Date: Mon, 17 Jan 2022 23:05:14 GMT
- Title: Data-Centric Machine Learning in the Legal Domain
- Authors: Hannes Westermann, Jaromir Savelka, Vern R. Walker, Kevin D. Ashley,
Karim Benyekhlef
- Abstract summary: This paper explores how changes in a data set influence the measured performance of a model.
Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance.
The observed effects are surprisingly pronounced, especially when the per-class performance is considered.
- Score: 0.2624902795082451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning research typically starts with a fixed data set created
early in the process. The focus of the experiments is finding a model and
training procedure that result in the best possible performance in terms of
some selected evaluation metric. This paper explores how changes in a data set
influence the measured performance of a model. Using three publicly available
data sets from the legal domain, we investigate how changes to their size, the
train/test splits, and the human labelling accuracy impact the performance of a
trained deep learning classifier. We assess the overall performance (weighted
average) as well as the per-class performance. The observed effects are
surprisingly pronounced, especially when the per-class performance is
considered. We investigate how "semantic homogeneity" of a class, i.e., the
proximity of sentences in a semantic embedding space, influences the difficulty
of its classification. The presented results have far reaching implications for
efforts related to data collection and curation in the field of AI & Law. The
results also indicate that enhancements to a data set could be considered,
alongside the advancement of the ML models, as an additional path for
increasing classification performance on various tasks in AI & Law. Finally, we
discuss the need for an established methodology to assess the potential effects
of data set properties.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Word Matters: What Influences Domain Adaptation in Summarization? [43.7010491942323]
This paper investigates the fine-grained factors affecting domain adaptation performance.
We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization.
Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship.
arXiv Detail & Related papers (2024-06-21T02:15:49Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - A classification performance evaluation measure considering data
separability [6.751026374812737]
We propose a new separability measure--the rate of separability (RS)--based on the data coding rate.
We demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset.
arXiv Detail & Related papers (2022-11-10T09:18:26Z) - Improving Data Quality with Training Dynamics of Gradient Boosting
Decision Trees [1.5605040219256345]
We propose a method based on metrics from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example.
We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
arXiv Detail & Related papers (2022-10-20T15:02:49Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Revisiting Data Complexity Metrics Based on Morphology for Overlap and
Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular
Problems Prospect [9.666866159867444]
This research work focuses on revisiting complexity metrics based on data morphology.
Being based on ball coverage by classes, they are named after Overlap Number of Balls.
arXiv Detail & Related papers (2020-07-15T18:21:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.