Calibrated Interpretation: Confidence Estimation in Semantic Parsing
- URL: http://arxiv.org/abs/2211.07443v6
- Date: Thu, 6 Jul 2023 22:14:43 GMT
- Title: Calibrated Interpretation: Confidence Estimation in Semantic Parsing
- Authors: Elias Stengel-Eskin and Benjamin Van Durme
- Abstract summary: We investigate the calibration of popular generation models across four popular semantic parsing datasets.
We analyze factors associated with calibration error and release new confidence-based challenge splits of two parsing datasets.
- Score: 37.28245521206576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sequence generation models are increasingly being used to translate natural
language into programs, i.e. to perform executable semantic parsing. The fact
that semantic parsing aims to predict programs that can lead to executed
actions in the real world motivates developing safe systems. This in turn makes
measuring calibration -- a central component to safety -- particularly
important. We investigate the calibration of popular generation models across
four popular semantic parsing datasets, finding that it varies across models
and datasets. We then analyze factors associated with calibration error and
release new confidence-based challenge splits of two parsing datasets. To
facilitate the inclusion of calibration in semantic parsing evaluations, we
release a library for computing calibration metrics.
Related papers
- A comprehensive review of classifier probability calibration metrics [0.0]
Probabilities or confidence values produced by AI andML models often do not reflect their true accuracy.
Probabilities calibration metrics measure the discrepancy between confidence and accuracy.
arXiv Detail & Related papers (2025-04-25T11:44:44Z) - Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.
This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.
We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z) - Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Beyond Probability Partitions: Calibrating Neural Networks with Semantic
Aware Grouping [45.09248880938502]
Research has shown that deep networks tend to be overly optimistic about their predictions, leading to an underestimation of prediction errors.
We propose a more generalized definition of calibration error called Partitioned Error (PCE)
We show that the relationship between model accuracy and calibration lies in the granularity of the partitioning function.
arXiv Detail & Related papers (2023-06-08T07:16:03Z) - Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks.
We analyze problem statement, calibration definitions, and different approaches to evaluation.
Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z) - Calibrated neighborhood aware confidence measure for deep metric
learning [0.0]
Deep metric learning has been successfully applied to problems in few-shot learning, image retrieval, and open-set classifications.
measuring the confidence of a deep metric learning model and identifying unreliable predictions is still an open challenge.
This paper focuses on defining a calibrated and interpretable confidence metric that closely reflects its classification accuracy.
arXiv Detail & Related papers (2020-06-08T21:05:38Z) - Multivariate Confidence Calibration for Object Detection [7.16879432974126]
We present a novel framework to measure and calibrate biased confidence estimates of object detection methods.
Our approach allows, for the first time, to obtain calibrated confidence estimates with respect to image location and box scale.
We show that our developed methods outperform state-of-the-art calibration models for the task of object detection.
arXiv Detail & Related papers (2020-04-28T14:17:41Z) - Calibrating Structured Output Predictors for Natural Language Processing [8.361023354729731]
We propose a general calibration scheme for output entities of interest in neural-network based structured prediction models.
Our proposed method can be used with any binary class calibration scheme and a neural network model.
We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering.
arXiv Detail & Related papers (2020-04-09T04:14:46Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.