Rissanen Data Analysis: Examining Dataset Characteristics via
Description Length
- URL: http://arxiv.org/abs/2103.03872v1
- Date: Fri, 5 Mar 2021 18:58:32 GMT
- Title: Rissanen Data Analysis: Examining Dataset Characteristics via
Description Length
- Authors: Ethan Perez, Douwe Kiela, Kyunghyun Cho
- Abstract summary: We introduce a method to determine if a certain capability helps to achieve an accurate model of given data.
Since minimum program length is uncomputable, we estimate the labels' minimum description length (MDL) as a proxy.
We call the method Rissanen Data Analysis (RDA) after the father of MDL.
- Score: 78.42578316883271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method to determine if a certain capability helps to achieve
an accurate model of given data. We view labels as being generated from the
inputs by a program composed of subroutines with different capabilities, and we
posit that a subroutine is useful if and only if the minimal program that
invokes it is shorter than the one that does not. Since minimum program length
is uncomputable, we instead estimate the labels' minimum description length
(MDL) as a proxy, giving us a theoretically-grounded method for analyzing
dataset characteristics. We call the method Rissanen Data Analysis (RDA) after
the father of MDL, and we showcase its applicability on a wide variety of
settings in NLP, ranging from evaluating the utility of generating subquestions
before answering a question, to analyzing the value of rationales and
explanations, to investigating the importance of different parts of speech, and
uncovering dataset gender bias.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - LLVM Static Analysis for Program Characterization and Memory Reuse
Profile Estimation [0.0]
This paper presents an LLVM-based probabilistic static analysis method.
It accurately predicts different program characteristics and estimates the reuse distance profile of a program.
The results show that our approach can predict application characteristics accurately compared to another LLVM-based dynamic code analysis tool, Byfl.
arXiv Detail & Related papers (2023-11-20T23:05:06Z) - Probing for Labeled Dependency Trees [25.723591566201343]
DepProbe is a linear probe which can extract labeled and directed dependency parse trees from embeddings.
Across 13 languages, our proposed method identifies the best source treebank of the time.
arXiv Detail & Related papers (2022-03-24T10:21:07Z) - Parallel feature selection based on the trace ratio criterion [4.30274561163157]
This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST)
Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness.
The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison.
arXiv Detail & Related papers (2022-03-03T10:50:33Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Reducing Confusion in Active Learning for Part-Of-Speech Tagging [100.08742107682264]
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost.
We study the problem of selecting instances which maximally reduce the confusion between particular pairs of output tags.
Our proposed AL strategy outperforms other AL strategies by a significant margin.
arXiv Detail & Related papers (2020-11-02T06:24:58Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Subtask Analysis of Process Data Through a Predictive Model [5.7668512557707166]
This paper develops a computationally efficient method for exploratory analysis of such process data.
The new approach segments a lengthy individual process into a sequence of short subprocesses to achieve complexity reduction.
We use the process data from PIAAC 2012 to demonstrate how exploratory analysis of process data can be done with the new approach.
arXiv Detail & Related papers (2020-08-29T21:11:01Z) - Information-Theoretic Probing with Minimum Description Length [74.29846942213445]
We propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL)
With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data.
We show that these methods agree in results and are more informative and stable than the standard probes.
arXiv Detail & Related papers (2020-03-27T09:35:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.