Related papers: Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

URL: http://arxiv.org/abs/2508.12461v2
Date: Fri, 26 Sep 2025 12:42:33 GMT
Title: Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models
Authors: Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song,
Abstract summary: In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019.<n>We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters.<n>Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks.
Score: 13.622744836632231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments. More details and evaluation scripts are available at the \href{https://ai-agent-lab.github.io/gpt-oss}{Project Webpage}.

Related papers

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks [4.739000717606982]
llama-embed-nemotron-8b is an open-weights text embedding model.<n>It achieves state-of-the-art performance on the Multilingual Massive Text Embedding Benchmark.
arXiv Detail & Related papers (2025-11-10T12:13:16Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
gpt-oss-120b & gpt-oss-20b Model Card [128.07668730833967]
We present two open-weight reasoning models that push the frontier of accuracy and inference cost.<n>The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning.
arXiv Detail & Related papers (2025-08-08T19:24:38Z)
Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks [0.0]
Small or compact models, though more efficient, often lack sufficient support for underrepresented languages.<n>This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks.<n> tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks.
arXiv Detail & Related papers (2025-03-18T07:44:49Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation.<n>S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
Biases in Edge Language Models: Detection, Analysis, and Mitigation [0.2184775414778289]
This paper conducts a comparative analysis of text-based bias across language model deployments on edge, cloud, and desktop environments.<n>We show that Llama-2 running on Raspberry Pi 4 is 43.23% and 21.89% more prone to showing bias over time compared to models running on the desktop and cloud-based environments.
arXiv Detail & Related papers (2025-02-17T01:57:31Z)
Training Sparse Mixture Of Experts Text Embedding Models [0.0]
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts.<n>This scaling approach introduces significant deployment challenges, including increased inference latency and memory usage.<n>We introduce Nomic Embed v2, the first general purpose MoE text embedding model.
arXiv Detail & Related papers (2025-02-11T21:36:31Z)
Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models [121.0693322732454]
This paper proposes a textbfCraFT' approach for fine-tuning black-box vision-language models to downstream tasks. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT.
arXiv Detail & Related papers (2024-02-06T14:53:19Z)
Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? [0.0]
Large language models (LLMs) are advanced artificial intelligence (AI) systems that can perform a variety of tasks commonly found in human intelligence tests. We investigated whether test scores may also exhibit positive intercorrelations. We found strong empirical evidence for a positive manifold and a general factor of ability.
arXiv Detail & Related papers (2023-10-17T22:42:12Z)
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets. We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
DetIE: Multilingual Open Information Extraction Inspired by Object Detection [10.269858179091111]
We present a novel single-pass method for OpenIE inspired by object detection algorithms from computer vision. We show performance improvement 15% on multilingual Re-OIE2016, reaching 75% F1 for both Portuguese and Spanish languages.
arXiv Detail & Related papers (2022-06-24T23:47:00Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.