Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
- URL: http://arxiv.org/abs/2510.25933v1
- Date: Wed, 29 Oct 2025 20:12:36 GMT
- Title: Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning
- Authors: Nissan Yaron, Dan Bystritsky, Ben-Etzion Yaron,
- Abstract summary: Humans-Junior is a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $pm 5$ pp equivalence margin.<n>Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
Related papers
- Adversarial Training for Process Reward Models [47.92183495904245]
We introduce Adversarially Trained PRMs (textttAPRM), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$)<n>This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels.
arXiv Detail & Related papers (2025-11-28T05:32:01Z) - A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning [40.6234318894435]
Large language models split into two families: reasoning-centric LLMs and agentic LLMs.<n>This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries.<n>We present Adaptive Agent Foundation Model (A$2$FM), a unified framework that follows a route-then-align principle.
arXiv Detail & Related papers (2025-10-13T17:08:25Z) - Performance of GPT-5 Frontier Models in Ophthalmology Question Answering [6.225411871775591]
Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on medical question-answering tasks.<n>We evaluated 12 configurations of OpenAI's GPT-5 series alongside o1-high, o3-high, and GPT-4o.<n> GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high)<n>These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of
arXiv Detail & Related papers (2025-08-13T17:17:17Z) - IF-GUIDE: Influence Function-Guided Detoxification of LLMs [53.051109450536885]
We study how training data contributes to the emergence of toxic behaviors in large-language models.<n>We propose a $proactive approach that leverages influence functions to identify harmful tokens within any training data and suppress their impact during training.<n>We present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective.
arXiv Detail & Related papers (2025-06-02T15:32:36Z) - ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis [0.0]
This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification.<n>We fine-tuned four models using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent.<n>Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance.
arXiv Detail & Related papers (2024-12-29T05:29:52Z) - Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models [27.675558033502565]
We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection.
For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of.
For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
arXiv Detail & Related papers (2024-07-12T03:33:13Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course [0.0]
This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language.
Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 times 10-10$)
The blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely
arXiv Detail & Related papers (2024-03-25T17:41:02Z) - InheritSumm: A General, Versatile and Compact Summarizer by Distilling
from GPT [75.29359361404073]
InheritSumm is a versatile and compact summarization model derived from GPT-3.5 through distillation.
It achieves similar or superior performance to GPT-3.5 in zeroshot and fewshot settings.
arXiv Detail & Related papers (2023-05-22T14:52:32Z) - Ensemble of Averages: Improving Model Selection and Boosting Performance
in Domain Generalization [63.28279815753543]
In Domain Generalization (DG) settings, models trained on a given set of training domains have notoriously chaotic performance on shifted test domains.
We first show that a simple protocol for averaging model parameters along the optimization path, starting early during training, significantly boosts domain generalizationity.
We show that an ensemble of independently trained models also has a chaotic behavior in the DG setting.
arXiv Detail & Related papers (2021-10-21T00:08:17Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - Uncovering the Limits of Adversarial Training against Norm-Bounded
Adversarial Examples [47.27255244183513]
We study the effect of different training losses, model sizes, activation functions, the addition of unlabeled data (through pseudo-labeling) and other factors on adversarial robustness.
We discover that it is possible to train robust models that go well beyond state-of-the-art results by combining larger models, Swish/SiLU activations and model weight averaging.
arXiv Detail & Related papers (2020-10-07T18:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.