Related papers: ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

URL: http://arxiv.org/abs/2508.16889v4
Date: Wed, 08 Oct 2025 21:13:00 GMT
Title: ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
Authors: Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park,
Abstract summary: We present exMT, a benchmark for objective extraction and metacognition.<n>Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence.<n> Accuracy is scored by similarity to gold objectives, then thresholded once on 300 calibration items.
Score: 12.396822247035578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items ($\tau^\star = 0.66$; $F_1@\tau^\star = 0.891$). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1, gemini-2.5-flash) evaluated on SafeMTData\_Attack600, SafeMTData\_1K, and MHJ, kimi-k2 achieves the highest objective-extraction accuracy (0.612; 95\% CI [0.594, 0.630]), while claude-sonnet-4 (0.603) and deepseek-v3.1 (0.599) are statistically tied. claude-sonnet-4 offers the best selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Performance varies sharply across datasets (16--82\% accuracy), showing that automated obfuscation imposes challenges beyond model choice. High-confidence errors remain: Wrong@0.90 ranges from 14.9\% (claude-sonnet-4) to 47.7\% (Qwen3-235B-A22B-FP8). ObjexMT therefore supplies an actionable test for LLM judges: when objectives are implicit, judges often misinfer them; exposing objectives or gating decisions by confidence is advisable. All experimental data are in the Supplementary Material and at https://github.com/hyunjun1121/ObjexMT_dataset.

Related papers

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge [8.50639201265868]
We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples.<n>We mine boundary cases where two strong annotators conflict, using a judge to resolve labels.<n>Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points.
arXiv Detail & Related papers (2026-01-14T04:26:43Z)
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems [0.29465623430708904]
Uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap.<n>We introduce Causal Judge Evaluation, a framework that fixes all three failures.
arXiv Detail & Related papers (2025-12-11T22:16:24Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
Tracing LLM Reasoning Processes with Strategic Games: A Framework for Planning, Revision, and Resource-Constrained Decision Making [38.75183725659772]
Large language models (LLMs) are increasingly used for tasks that require complex reasoning.<n>We argue that measuring internal processes is essential for understanding model behavior and improving reliability.<n>We introduce a framework that evaluates LLMs along three core dimensions: planning, revision, and resource-constrained decision making.
arXiv Detail & Related papers (2025-06-13T17:59:10Z)
LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLMs [0.0]
We show how Linear Probes can be used to provide an estimation on the performance of a compressed large language model.<n>We also show their suitability to set the cut-off point when applying layer pruning compression.<n>Our approach, dubbed $LPASS$, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples.
arXiv Detail & Related papers (2025-05-30T10:37:14Z)
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation [0.8087612190556891]
VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub and annotated by security experts.<n>For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weaknession (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan.<n>Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER.<n>Our results show that current state-of-the-
arXiv Detail & Related papers (2025-05-26T01:20:44Z)
MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z)
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs [4.492376241514766]
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.<n>Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs.<n>We present and evaluate a method to assess the robustness of LLM alignment.
arXiv Detail & Related papers (2025-01-27T22:13:05Z)
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) [0.5434005537854512]
This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS)<n>We compared the performance of four state-of-the-art LLMs in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting.
arXiv Detail & Related papers (2025-01-21T04:05:45Z)
SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits [31.98028879922584]
We introduce SummExecEdit, a novel pipeline and benchmark to assess models on their ability to both detect factual errors and provide accurate explanations.<n>The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark.<n>We identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
arXiv Detail & Related papers (2024-12-17T23:26:44Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames. It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.