Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking
- URL: http://arxiv.org/abs/2403.11123v1
- Date: Sun, 17 Mar 2024 07:07:44 GMT
- Title: Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking
- Authors: Taha Aksu, Nancy F. Chen,
- Abstract summary: We introduce a novel metric: Granular Change Accuracy (GCA)
GCA focuses on evaluating the predicted changes in dialogue state over the entire dialogue history.
We find that GCA effectively reduces biases arising from distribution uniformity and the positioning of errors.
- Score: 34.392893920243644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current metrics for evaluating Dialogue State Tracking (DST) systems exhibit three primary limitations. They: i) erroneously presume a uniform distribution of slots throughout the dialog, ii) neglect to assign partial scores for individual turns, iii) frequently overestimate or underestimate performance by repeatedly counting the models' successful or failed predictions. To address these shortcomings, we introduce a novel metric: Granular Change Accuracy (GCA). GCA focuses on evaluating the predicted changes in dialogue state over the entire dialogue history. Benchmarking reveals that GCA effectively reduces biases arising from distribution uniformity and the positioning of errors across turns, resulting in a more precise evaluation. Notably, we find that these biases are particularly pronounced when evaluating few-shot or zero-shot trained models, becoming even more evident as the model's error rate increases. Hence, GCA offers significant promise, particularly for assessing models trained with limited resources. Our GCA implementation is a useful addition to the pool of DST metrics.
Related papers
- CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models [29.95434387343843]
We propose a unified framework that mitigates length bias through three components.<n>CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning.
arXiv Detail & Related papers (2025-07-21T15:07:59Z) - Reasoning Models Can be Easily Hacked by Fake Reasoning Bias [59.79548223686273]
We introduce THEATER, a comprehensive benchmark to evaluate Reasoning Theater Bias (RTB)<n>We investigate six bias types including Simple Cues and Fake Chain-of-Thought.<n>We identify'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB.
arXiv Detail & Related papers (2025-07-18T09:06:10Z) - Rethinking Metrics and Benchmarks of Video Anomaly Detection [12.500876355560184]
Video Anomaly Detection (VAD) aims to detect anomalies that deviate from expectation.<n>In this paper, we rethink VAD evaluation protocols through comprehensive experimental analyses.<n>We propose three novel evaluation methods to address these limitations.
arXiv Detail & Related papers (2025-05-25T08:09:42Z) - Bug Destiny Prediction in Large Open-Source Software Repositories through Sentiment Analysis and BERT Topic Modeling [3.481985817302898]
We leverage features available before a bug is resolved to enhance predictive accuracy.
Our methodology incorporates sentiment analysis to derive both an emotionality score and a sentiment classification.
Results demonstrate that sentiment analysis serves as a valuable predictor of a bug's eventual outcome.
arXiv Detail & Related papers (2025-04-22T15:18:14Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in
End-to-End ASR [1.8477401359673709]
Class-probability-based confidence scores do not accurately represent quality of overconfident ASR predictions.
We propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train Confidence Estimation Model (CEM)
We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes.
arXiv Detail & Related papers (2024-01-06T16:29:13Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - On the Efficacy of Generalization Error Prediction Scoring Functions [33.24980750651318]
Generalization error predictors (GEPs) aim to predict model performance on unseen distributions by deriving dataset-level error estimates from sample-level scores.
We rigorously study the effectiveness of popular scoring functions (confidence, local manifold smoothness, model agreement) independent of mechanism choice.
arXiv Detail & Related papers (2023-03-23T18:08:44Z) - Mismatch between Multi-turn Dialogue and its Evaluation Metric in
Dialogue State Tracking [15.54992415806844]
Dialogue state tracking (DST) aims to extract essential information from multi-turn dialogue situations.
We propose textbfrelative slot accuracy to complement existing metrics.
This study also encourages not solely the reporting of joint goal accuracy, but also various complementary metrics in DST tasks.
arXiv Detail & Related papers (2022-03-07T04:07:36Z) - Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular.
We show that the statistical problems with covariance estimation drive the poor performance of H-score.
We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z) - FairCanary: Rapid Continuous Explainable Fairness [8.362098382773265]
We present Quantile Demographic Drift (QDD), a novel model bias quantification metric.
QDD is ideal for continuous monitoring scenarios, does not suffer from the statistical limitations of conventional threshold-based bias metrics.
We incorporate QDD into a continuous model monitoring system, called FairCanary, that reuses existing explanations computed for each individual prediction.
arXiv Detail & Related papers (2021-06-13T17:47:44Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - CoCo: Controllable Counterfactuals for Evaluating Dialogue State
Trackers [92.5628632009802]
We propose controllable counterfactuals (CoCo) to bridge the gap and evaluate dialogue state tracking (DST) models on novel scenarios.
CoCo generates novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, and (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow.
Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations.
arXiv Detail & Related papers (2020-10-24T09:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.