Harmonic LLMs are Trustworthy
- URL: http://arxiv.org/abs/2404.19708v2
- Date: Thu, 25 Jul 2024 16:16:46 GMT
- Title: Harmonic LLMs are Trustworthy
- Authors: Nicholas S. Kersting, Mohammad Rahman, Suchismitha Vedala, Yang Wang,
- Abstract summary: We introduce an intuitive method to test the robustness of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as $gamma$.
We measure $gamma$ in 10 popular LLMs across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA.
- Score: 3.8119386967826294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time via its local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. To show general application and immediacy of results, we measure $\gamma$ in 10 popular LLMs (ChatGPT, Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B, Mistral-7B and MPT-7B) across thousands of queries in three objective domains: WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested, human annotation confirms that $\gamma \to 0$ indicates trustworthiness, and conversely searching higher values of $\gamma$ easily exposes examples of hallucination, a fact that enables efficient adversarial prompt generation through stochastic gradient ascent in $\gamma$. The low-$\gamma$ leaders among the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B, providing evidence that mid-size open-source models can win out against large commercial models.
Related papers
- RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs [0.0]
Routed Online Best-of-$n$ is a sequential multi-LLM alternative to the prevailing single-model best-of-$n$.<n>Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone.
arXiv Detail & Related papers (2025-12-05T08:55:39Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification [1.2234742322758418]
This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews.<n>We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3, Gemini-2.5-Flash, Claude-3.5-Haiku, Llama-4-Maverick) under five prompt types.<n>CoT-few-shot yields the most reliable precision-recall balance; zero-shot maximizes recall for high-sensitivity passes; and self-reflection underperforms due to over-inclusivity and instability across models.
arXiv Detail & Related papers (2025-10-17T16:53:09Z) - LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought [18.539462131974215]
We propose Expert $textbfT$ree-$textbfo$f-$textbfT$hought (LETToT), a framework that leverages expert-derived reasoning structures.<n>Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines.
arXiv Detail & Related papers (2025-08-15T07:37:12Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework [11.613261852608062]
WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat.
This work is the first to explore the feasibility and limitations of Large Language Models for WebShell detection.
arXiv Detail & Related papers (2025-04-14T21:09:37Z) - Large Language Model Confidence Estimation via Black-Box Access [30.490207799344333]
We explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them.
We propose a simple and generalize framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence.
We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2,-13b, Mistral-7b and GPT-4 on four benchmark Q&A tasks as well as Pegasus-large and BART-large on two benchmark summarization tasks.
arXiv Detail & Related papers (2024-06-01T02:08:44Z) - RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness.
RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm.
Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z) - How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
We introduce GAMA($gamma$)-Bench, a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.
$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.
Results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Generation, Distillation and Evaluation of Motivational
Interviewing-Style Reflections with a Foundational Language Model [2.33956825429387]
We present a method for distilling the generation of reflections from a Foundational Language Model into smaller models.
We first show that GPT-4, using zero-shot prompting, can generate reflections at near 100% success rate.
We also show that GPT-4 can help in the labor-intensive task of evaluating the quality of the distilled models.
arXiv Detail & Related papers (2024-02-01T22:54:31Z) - Split and Merge: Aligning Position Biases in Large Language Model based
Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z) - Ensemble of Averages: Improving Model Selection and Boosting Performance
in Domain Generalization [63.28279815753543]
In Domain Generalization (DG) settings, models trained on a given set of training domains have notoriously chaotic performance on shifted test domains.
We first show that a simple protocol for averaging model parameters along the optimization path, starting early during training, significantly boosts domain generalizationity.
We show that an ensemble of independently trained models also has a chaotic behavior in the DG setting.
arXiv Detail & Related papers (2021-10-21T00:08:17Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.