Related papers: EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge

URL: http://arxiv.org/abs/2601.09142v1
Date: Wed, 14 Jan 2026 04:26:43 GMT
Title: EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
Authors: Shijian Ma, Yan Lin, Yi Yang,
Abstract summary: We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples.<n>We mine boundary cases where two strong annotators conflict, using a judge to resolve labels.<n>Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points.
Score: 8.50639201265868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.

Related papers

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models [50.99097734404912]
We show that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses.<n>Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24.
arXiv Detail & Related papers (2026-01-11T13:34:44Z)
Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making [1.2691047660244335]
Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs.<n>We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models.
arXiv Detail & Related papers (2026-01-04T13:19:27Z)
EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference [0.0]
We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness.<n>On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy.<n>On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains.
arXiv Detail & Related papers (2025-12-29T14:48:40Z)
RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z)
Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People [81.63702981397408]
Given limited resources, to what extent do agents based on language models (LMs) act rationally?<n>We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior.<n>For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling)
arXiv Detail & Related papers (2025-10-23T17:57:28Z)
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks [12.396822247035578]
We present exMT, a benchmark for objective extraction and metacognition.<n>Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence.<n> Accuracy is scored by similarity to gold objectives, then thresholded once on 300 calibration items.
arXiv Detail & Related papers (2025-08-23T03:32:04Z)
Adversarial Preference Learning for Robust LLM Alignment [24.217309343426297]
Adversarial Preference Learning (APL) is an iterative adversarial training method incorporating three key innovations.<n>First, a direct harmfulness metric based on the model's intrinsic preference probabilities.<n>Second, a conditional generative attacker that synthesizes input-specific adversarial variations.
arXiv Detail & Related papers (2025-05-30T09:02:07Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z)
Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection [9.241737058291823]
Adrial training methods generate independent initial perturbation for adversarial samples from a simple uniform distribution. We propose a simple yet effective training framework called Batch-in-Batch to enhance models. We show that models trained within the BB framework consistently have higher adversarial accuracy across various adversarial settings.
arXiv Detail & Related papers (2024-06-06T13:34:43Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing. FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity. We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z)
Latent Imitator: Generating Natural Individual Discriminatory Instances for Black-Box Fairness Testing [45.183849487268496]
This paper proposes a framework named Latent Imitator (LIMI) to generate more natural individual discriminatory instances. We first derive a surrogate linear boundary to approximate the decision boundary of the target model. We then manipulate random latent vectors to the surrogate boundary with a one-step movement, and further conduct vector calculation to probe two potential discriminatory candidates.
arXiv Detail & Related papers (2023-05-19T11:29:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.