Perceived Confidence Scoring for Data Annotation with Zero-Shot LLMs
- URL: http://arxiv.org/abs/2502.07186v1
- Date: Tue, 11 Feb 2025 02:25:44 GMT
- Title: Perceived Confidence Scoring for Data Annotation with Zero-Shot LLMs
- Authors: Sina Salimian, Gias Uddin, Most Husne Jahan, Shaina Raza,
- Abstract summary: We introduce Perceived Confidence Scoring (PCS) that evaluates LLM's confidence for its classification of an input by leveraging Metamorphic Relations (MRs)
PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain.
- Score: 2.4749083496491684
- License:
- Abstract: Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment/emotion detection of a given input as a sentence/article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique Perceived Confidence Scoring (PCS) that evaluates LLM's confidence for its classification of an input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually mutated versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of LLM responses across these variations, PCS computes a confidence score based on the frequency of predicted labels. PCS can be used both for single LLM and multiple LLM settings (e.g., majority voting). We introduce an algorithm Perceived Differential Evolution (PDE) that determines the optimal weights assigned to the MRs and the LLMs for a classification task. Empirical evaluation shows PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain. When combining all three models, PCS significantly outperforms majority voting by 7.75%.
Related papers
- Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference [63.03859517284341]
An automatic evaluation framework aims to rank LLMs based on their alignment with human preferences.
An automatic LLM bencher consists of four components: the input set, the evaluation model, the evaluation type and the aggregation method.
arXiv Detail & Related papers (2024-12-31T17:46:51Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Bayesian Calibration of Win Rate Estimation with LLM Evaluators [20.588104799661014]
We propose two methods to improve the accuracy of win rate estimation using large language models (LLMs) as evaluators.
We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks.
arXiv Detail & Related papers (2024-11-07T04:32:40Z) - Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring [21.7782670140939]
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments.
While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear.
This paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores.
arXiv Detail & Related papers (2024-07-04T22:26:20Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Interpreting Learned Feedback Patterns in Large Language Models [11.601799960959214]
We train probes to estimate the feedback signal implicit in the activations of a fine-tuned language model.
We compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback.
We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs.
arXiv Detail & Related papers (2023-10-12T09:36:03Z) - Quantifying Uncertainty in Answers from any Language Model and Enhancing
their Trustworthiness [16.35655151252159]
We introduce BSDetector, a method for detecting bad and speculative answers from a pretrained Large Language Model.
Our uncertainty quantification technique works for any LLM accessible only via a black-box API.
arXiv Detail & Related papers (2023-08-30T17:53:25Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.