Related papers: Fortytwo: Swarm Inference with Peer-Ranked Consensus

Fortytwo: Swarm Inference with Peer-Ranked Consensus

URL: http://arxiv.org/abs/2510.24801v1
Date: Mon, 27 Oct 2025 23:19:48 GMT
Title: Fortytwo: Swarm Inference with Peer-Ranked Consensus
Authors: Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov,
Abstract summary: We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference.<n>Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting.
Score: 36.94429692322632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

Related papers

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference [6.568081870814357]
PRISM is an inference algorithm that uses step-level verification to guide both population refinement and solution aggregation.<n>Across mathematics and science benchmarks, PRISM is competitive or outperforms existing DEEPTHINK methods.
arXiv Detail & Related papers (2026-03-03T00:03:42Z)
BiRQA: Bidirectional Robust Quality Assessment for Images [49.74447451098852]
Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling.<n>We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid.<n>On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running 3x faster than previous SOTA models.
arXiv Detail & Related papers (2026-02-23T20:52:56Z)
Adversarial Question Answering Robustness: A Multi-Level Error Analysis and Mitigation Study [0.0]
Question answering (QA) systems achieve impressive performance on standard benchmarks like SQuAD, but remain vulnerable to adversarial examples.<n>This project investigates the adversarial robustness of transformer models on the AddSent adversarial dataset.
arXiv Detail & Related papers (2026-01-06T04:20:33Z)
EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference [0.0]
We present EdgeJury, a lightweight ensemble framework that improves truthfulness and robustness.<n>On TruthfulQA (MC1), EdgeJury achieves 76.2% accuracy.<n>On a 200-question adversarial EdgeCases set, EdgeJury yields +48.2% relative gains.
arXiv Detail & Related papers (2025-12-29T14:48:40Z)
Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning [12.354777054071379]
Test-time reinforcement learning mitigates reliance on annotated data by using majority voting results as pseudo-labels.<n>This voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance.<n>We propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE) to address these issues.
arXiv Detail & Related papers (2025-12-17T07:21:54Z)
DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support [2.4897847232811716]
DGTEN (Deep Gaussian-based Trust Evaluation Network) introduces a unified graph framework.<n>It combines uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks.<n>On two signed Bitcoin trust networks, DGTEN delivers significant improvements.
arXiv Detail & Related papers (2025-10-08T23:38:55Z)
RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization [52.01526898310723]
We introduce RESTRAIN, a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal.<n>Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution.<n>On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data.
arXiv Detail & Related papers (2025-10-02T16:24:01Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments [0.0]
Federated Learning (FL) in 5G and edge network environments face severe security threats from adversarial clients.<n>This paper introduces Hybrid Reputation Aggregation (HRA), a novel robust aggregation mechanism designed to defend against adversarial behaviors in FL without prior knowledge of the attack type.<n>HRA combines geometric anomaly detection with momentum-based reputation tracking of clients.
arXiv Detail & Related papers (2025-09-22T17:18:59Z)
Nearest Neighbor Projection Removal Adversarial Training [5.146355145217634]
We introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples.<n>Our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability.
arXiv Detail & Related papers (2025-09-09T12:38:41Z)
VALID: a Validated Algorithm for Learning in Decentralized Networks with Possible Adversarial Presence [13.612214163974459]
We introduce the paradigm of validated decentralized learning for undirected networks with heterogeneous data. VALID protocol is the first to achieve a validated learning guarantee. Remarkably, VALID retains optimal performance metrics in adversary-free environments.
arXiv Detail & Related papers (2024-05-12T15:55:43Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.