Related papers: MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

URL: http://arxiv.org/abs/2504.20168v1
Date: Mon, 28 Apr 2025 18:06:38 GMT
Title: MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
Authors: Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, Sam Thomson,
Abstract summary: Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
Score: 54.63478102768333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration [24.739131794947838]
We conduct a study on a family of state-of-the-art Large Language Models (LLMs) on three datasets with two mainstream tool-use frameworks.<n>Our study reveals the tool-abuse behavior of LLMs, a tendency for models to misuse tools with overconfidence.<n>We propose a novel approach, textitCAL, to mitigate the observed issues.
arXiv Detail & Related papers (2024-12-11T06:09:12Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
Few-Shot Recalibration of Language Models [23.829795148520834]
We train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice. Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice. Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods.
arXiv Detail & Related papers (2024-03-27T06:25:40Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
On the Costs and Benefits of Adopting Lifelong Learning for Software Analytics -- Empirical Study on Brown Build and Risk Prediction [17.502553991799832]
This paper evaluates the use of lifelong learning (LL) for industrial use cases at Ubisoft. LL is used to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data.
arXiv Detail & Related papers (2023-05-16T21:57:16Z)
Useful Confidence Measures: Beyond the Max Score [9.189382034558657]
We derive several confidence measures that depend on information beyond the maximum score. We show that when models are evaluated on the out-of-distribution data out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal.
arXiv Detail & Related papers (2022-10-25T14:54:44Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.