EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
- URL: http://arxiv.org/abs/2601.09142v1
- Date: Wed, 14 Jan 2026 04:26:43 GMT
- Title: EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
- Authors: Shijian Ma, Yan Lin, Yi Yang,
- Abstract summary: We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples.<n>We mine boundary cases where two strong annotators conflict, using a judge to resolve labels.<n>Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points.
- Score: 8.50639201265868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
Related papers
- Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models [50.99097734404912]
We show that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses.<n>Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24.
arXiv Detail & Related papers (2026-01-11T13:34:44Z) - Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making [1.2691047660244335]
Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs.<n>We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models.
arXiv Detail & Related papers (2026-01-04T13:19:27Z) - EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference [0.0]
We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness.<n>On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy.<n>On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains.
arXiv Detail & Related papers (2025-12-29T14:48:40Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People [81.63702981397408]
Given limited resources, to what extent do agents based on language models (LMs) act rationally?<n>We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior.<n>For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling)
arXiv Detail & Related papers (2025-10-23T17:57:28Z) - ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks [12.396822247035578]
We present exMT, a benchmark for objective extraction and metacognition.<n>Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence.<n> Accuracy is scored by similarity to gold objectives, then thresholded once on 300 calibration items.
arXiv Detail & Related papers (2025-08-23T03:32:04Z) - Adversarial Preference Learning for Robust LLM Alignment [24.217309343426297]
Adversarial Preference Learning (APL) is an iterative adversarial training method incorporating three key innovations.<n>First, a direct harmfulness metric based on the model's intrinsic preference probabilities.<n>Second, a conditional generative attacker that synthesizes input-specific adversarial variations.
arXiv Detail & Related papers (2025-05-30T09:02:07Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z) - Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection [9.241737058291823]
Adrial training methods generate independent initial perturbation for adversarial samples from a simple uniform distribution.
We propose a simple yet effective training framework called Batch-in-Batch to enhance models.
We show that models trained within the BB framework consistently have higher adversarial accuracy across various adversarial settings.
arXiv Detail & Related papers (2024-06-06T13:34:43Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks
through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing.
FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity.
We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z) - Latent Imitator: Generating Natural Individual Discriminatory Instances
for Black-Box Fairness Testing [45.183849487268496]
This paper proposes a framework named Latent Imitator (LIMI) to generate more natural individual discriminatory instances.
We first derive a surrogate linear boundary to approximate the decision boundary of the target model.
We then manipulate random latent vectors to the surrogate boundary with a one-step movement, and further conduct vector calculation to probe two potential discriminatory candidates.
arXiv Detail & Related papers (2023-05-19T11:29:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.