Rissanen Data Analysis: Examining Dataset Characteristics via
Description Length
- URL: http://arxiv.org/abs/2103.03872v1
- Date: Fri, 5 Mar 2021 18:58:32 GMT
- Title: Rissanen Data Analysis: Examining Dataset Characteristics via
Description Length
- Authors: Ethan Perez, Douwe Kiela, Kyunghyun Cho
- Abstract summary: We introduce a method to determine if a certain capability helps to achieve an accurate model of given data.
Since minimum program length is uncomputable, we estimate the labels' minimum description length (MDL) as a proxy.
We call the method Rissanen Data Analysis (RDA) after the father of MDL.
- Score: 78.42578316883271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a method to determine if a certain capability helps to achieve
an accurate model of given data. We view labels as being generated from the
inputs by a program composed of subroutines with different capabilities, and we
posit that a subroutine is useful if and only if the minimal program that
invokes it is shorter than the one that does not. Since minimum program length
is uncomputable, we instead estimate the labels' minimum description length
(MDL) as a proxy, giving us a theoretically-grounded method for analyzing
dataset characteristics. We call the method Rissanen Data Analysis (RDA) after
the father of MDL, and we showcase its applicability on a wide variety of
settings in NLP, ranging from evaluating the utility of generating subquestions
before answering a question, to analyzing the value of rationales and
explanations, to investigating the importance of different parts of speech, and
uncovering dataset gender bias.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.