Related papers: Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

URL: http://arxiv.org/abs/2103.03872v1
Date: Fri, 5 Mar 2021 18:58:32 GMT
Title: Rissanen Data Analysis: Examining Dataset Characteristics via Description Length
Authors: Ethan Perez, Douwe Kiela, Kyunghyun Cho
Abstract summary: We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. Since minimum program length is uncomputable, we estimate the labels' minimum description length (MDL) as a proxy. We call the method Rissanen Data Analysis (RDA) after the father of MDL.
Score: 78.42578316883271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

Related papers

LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework. We decompose the problem into several subtasks and introduce a series of novel strategies. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
LLVM Static Analysis for Program Characterization and Memory Reuse Profile Estimation [0.0]
This paper presents an LLVM-based probabilistic static analysis method. It accurately predicts different program characteristics and estimates the reuse distance profile of a program. The results show that our approach can predict application characteristics accurately compared to another LLVM-based dynamic code analysis tool, Byfl.
arXiv Detail & Related papers (2023-11-20T23:05:06Z)
Probing for Labeled Dependency Trees [25.723591566201343]
DepProbe is a linear probe which can extract labeled and directed dependency parse trees from embeddings. Across 13 languages, our proposed method identifies the best source treebank of the time.
arXiv Detail & Related papers (2022-03-24T10:21:07Z)
Parallel feature selection based on the trace ratio criterion [4.30274561163157]
This work presents a novel parallel feature selection approach for classification, namely Parallel Feature Selection using Trace criterion (PFST) Our method uses trace criterion, a measure of class separability used in Fisher's Discriminant Analysis, to evaluate feature usefulness. The experiments show that our method can produce a small set of features in a fraction of the amount of time by the other methods under comparison.
arXiv Detail & Related papers (2022-03-03T10:50:33Z)
Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts. We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z)
Reducing Confusion in Active Learning for Part-Of-Speech Tagging [100.08742107682264]
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. We study the problem of selecting instances which maximally reduce the confusion between particular pairs of output tags. Our proposed AL strategy outperforms other AL strategies by a significant margin.
arXiv Detail & Related papers (2020-11-02T06:24:58Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
Information-Theoretic Probing with Minimum Description Length [74.29846942213445]
We propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL) With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. We show that these methods agree in results and are more informative and stable than the standard probes.
arXiv Detail & Related papers (2020-03-27T09:35:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.