Measuring the Measuring Tools: An Automatic Evaluation of Semantic
Metrics for Text Corpora
- URL: http://arxiv.org/abs/2211.16259v1
- Date: Tue, 29 Nov 2022 14:47:07 GMT
- Title: Measuring the Measuring Tools: An Automatic Evaluation of Semantic
Metrics for Text Corpora
- Authors: George Kour, Samuel Ackerman, Orna Raz, Eitan Farchi, Boaz Carmeli,
Ateret Anaby-Tavor
- Abstract summary: The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications.
We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics.
- Score: 5.254054636427663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to compare the semantic similarity between text corpora is
important in a variety of natural language processing applications. However,
standard methods for evaluating these metrics have yet to be established. We
propose a set of automatic and interpretable measures for assessing the
characteristics of corpus-level semantic similarity metrics, allowing sensible
comparison of their behavior. We demonstrate the effectiveness of our
evaluation measures in capturing fundamental characteristics by evaluating them
on a collection of classical and state-of-the-art metrics. Our measures
revealed that recently-developed metrics are becoming better in identifying
semantic distributional mismatch while classical metrics are more sensitive to
perturbations in the surface text levels.
Related papers
- The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - MENLI: Robust Evaluation Metrics from Natural Language Inference [26.53850343633923]
Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks.
We develop evaluation metrics based on Natural Language Inference (NLI)
We show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics.
arXiv Detail & Related papers (2022-08-15T16:30:14Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - On the Intrinsic and Extrinsic Fairness Evaluation Metrics for
Contextualized Language Representations [74.70957445600936]
Multiple metrics have been introduced to measure fairness in various natural language processing tasks.
These metrics can be roughly categorized into two categories: 1) emphextrinsic metrics for evaluating fairness in downstream applications and 2) emphintrinsic metrics for estimating fairness in upstream language representation models.
arXiv Detail & Related papers (2022-03-25T22:17:43Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - LCEval: Learned Composite Metric for Caption Evaluation [37.2313913156926]
We propose a neural network-based learned metric to improve the caption-level caption evaluation.
This paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics.
Our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.
arXiv Detail & Related papers (2020-12-24T06:38:24Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.