Interpretable Multi-dataset Evaluation for Named Entity Recognition
- URL: http://arxiv.org/abs/2011.06854v2
- Date: Wed, 9 Dec 2020 04:53:07 GMT
- Title: Interpretable Multi-dataset Evaluation for Named Entity Recognition
- Authors: Jinlan Fu, Pengfei Liu, Graham Neubig
- Abstract summary: We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
- Score: 110.64368106131062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the proliferation of models for natural language processing tasks, it is
even harder to understand the differences between models and their relative
merits. Simply looking at differences between holistic metrics such as
accuracy, BLEU, or F1 does not tell us why or how particular methods perform
differently and how diverse datasets influence the model design choices. In
this paper, we present a general methodology for interpretable evaluation for
the named entity recognition (NER) task. The proposed evaluation method enables
us to interpret the differences in models and datasets, as well as the
interplay between them, identifying the strengths and weaknesses of current
systems. By making our analysis tool available, we make it easy for future
researchers to run similar analyses and drive progress in this area:
https://github.com/neulab/InterpretEval.
Related papers
- Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Interpretable Differencing of Machine Learning Models [20.99877540751412]
We formalize the problem of model differencing as one of predicting a dissimilarity function of two ML models' outputs.
A Joint Surrogate Tree (JST) is composed of two conjoined decision tree surrogates for the two models.
A JST provides an intuitive representation of differences and places the changes in the context of the models' decision logic.
arXiv Detail & Related papers (2023-06-10T16:15:55Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques.
IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features.
We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z) - Multivariate Data Explanation by Jumping Emerging Patterns Visualization [78.6363825307044]
We present VAX (multiVariate dAta eXplanation), a new VA method to support the identification and visual interpretation of patterns in multivariate data sets.
Unlike the existing similar approaches, VAX uses the concept of Jumping Emerging Patterns to identify and aggregate several diversified patterns, producing explanations through logic combinations of data variables.
arXiv Detail & Related papers (2021-06-21T13:49:44Z) - Triplot: model agnostic measures and visualisations for variable
importance in predictive models that take into account the hierarchical
correlation structure [3.0036519884678894]
We propose new methods to support model analysis by exploiting the information about the correlation between variables.
We show how to analyze groups of variables (aspects) both when they are proposed by the user and when they should be determined automatically.
We also present the new type of model visualisation, triplot, which exploits a hierarchical structure of variable grouping to produce a high information density model visualisation.
arXiv Detail & Related papers (2021-04-07T21:29:03Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link
Prediction Methods [27.27230441498167]
We take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment.
In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets.
We show that this leads to various problems in the interpretation of results, which may support misleading conclusions.
arXiv Detail & Related papers (2020-02-17T12:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.