Related papers: EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference

EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference

URL: http://arxiv.org/abs/2601.00850v1
Date: Mon, 29 Dec 2025 14:48:40 GMT
Title: EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference
Authors: Aayush Kumar,
Abstract summary: We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness.<n>On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy.<n>On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations hinder reliable question answering, especially in resource-constrained deployments where frontier-scale models or retrieval pipelines may be impractical. We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness using only small instruction-tuned language models (3B-8B) suitable for serverless edge inference. EdgeJury orchestrates four stages: (1) parallel role-specialized generation, (2) anonymized cross-review with structured critiques and rankings, (3) chairman synthesis that integrates the strongest content while addressing flagged issues, and (4) claim-level consistency labeling based on inter-model agreement. On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy (95% CI: 72.8-79.6%), a +21.4% relative improvement over a single 8B baseline (62.8%), and outperforms standard baselines including self-consistency and majority voting under transparent compute accounting (total tokens and platform cost reported). On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains (95% CI: 44.0-52.4%). Manual analysis on 100 incorrect answers shows an approximately 55% reduction in factual hallucination errors versus the single-model baseline. Deployed on Cloudflare Workers AI, EdgeJury achieves 8.4 s median end-to-end latency, demonstrating that coordinated small-model ensembles can improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs.

Related papers

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning [16.505918019260964]
We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable predictions.<n>We show that 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways.
arXiv Detail & Related papers (2026-03-03T19:43:36Z)
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters [169.7981969517903]
Step 3.5 Flash bridges frontier-level agentic intelligence and computational efficiency.<n>We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.
arXiv Detail & Related papers (2026-02-11T07:53:51Z)
ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
We present ACAR, a measurement framework for studying multi-model orchestration under auditable conditions.<n>ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes.<n>We evaluate ACAR on 1,510 tasks spanning four benchmarks, producing more than 7,550 auditable runs.
arXiv Detail & Related papers (2026-02-06T23:27:17Z)
Assessing LLM Reliability on Temporally Recent Open-Domain Questions [15.456770184839726]
Large Language Models (LLMs) are increasingly deployed for open-domain question answering.<n>We investigate how four open-source LLMs respond to 15,000 recent Reddit questions.<n>All models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap.
arXiv Detail & Related papers (2026-01-17T21:33:27Z)
Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
Fortytwo: Swarm Inference with Peer-Ranked Consensus [36.94429692322632]
We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference.<n>Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting.
arXiv Detail & Related papers (2025-10-27T23:19:48Z)
A Multimodal Approach to Heritage Preservation in the Context of Climate Change [0.0]
We propose a lightweight multimodal architecture that fuses sensor data (temperature, humidity) with visual imagery to predict severity at heritage sites.<n>On data from Strasbourg Cathedral, our model achieves 76.9% accu- racy, a 43% improvement over standard multimodal architectures.
arXiv Detail & Related papers (2025-10-15T22:07:57Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression [68.69801176669843]
We propose an online post-training RL method that prunes redundant steps and estimates difficulty.<n> TRAAC (Think Right with Adaptive, Attentive Compression) achieves an average absolute accuracy gain of 8.4%.<n>Although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets.
arXiv Detail & Related papers (2025-10-02T02:00:20Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks [12.396822247035578]
We present exMT, a benchmark for objective extraction and metacognition.<n>Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence.<n> Accuracy is scored by similarity to gold objectives, then thresholded once on 300 calibration items.
arXiv Detail & Related papers (2025-08-23T03:32:04Z)
A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks [0.0]
Confidence-diversity calibration is a quality assessment framework for accessible coding tasks.<n>Analysing 5,680 coding decisions from eight state-of-the-art LLMs, we find that mean self-confidence tracks inter-model agreement closely.
arXiv Detail & Related papers (2025-08-04T03:47:10Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.