Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?
- URL: http://arxiv.org/abs/2509.18843v1
- Date: Tue, 23 Sep 2025 09:27:57 GMT
- Title: Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?
- Authors: Damian Stachura, Joanna Konieczna, Artur Nowak,
- Abstract summary: Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs.<n>This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models.<n>In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet.
- Score: 0.5692553719616764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.
Related papers
- Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems [55.6590601898194]
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge.<n>Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model.<n>We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score.
arXiv Detail & Related papers (2025-09-30T01:25:19Z) - How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts? [5.0683148330498335]
Large Language Models (LLMs) can answer multiple questions based on the same conversational context.<n>We conduct extensive experiments and benchmark a range of both proprietary and public models on this challenging task.<n>Our findings highlight that while strong proprietary LLMs like GPT-4o achieve the best overall performance, fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy.
arXiv Detail & Related papers (2025-09-26T00:58:01Z) - Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities [0.0]
Large Language Models (LLMs) are widely used to support various disciplines, yet their potential in choice modelling remains relatively unexplored.<n>This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models.
arXiv Detail & Related papers (2025-07-29T13:24:44Z) - Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System [51.04535721779685]
This paper aims to demonstrate the potential and strengths of open-source collectives.<n>We propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance.<n> Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS.
arXiv Detail & Related papers (2025-07-14T16:17:11Z) - Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning [20.784944581469205]
COLLATE is a framework that tunes a (small) LLM to generate outputs from a pool of diverse rationales that selectively improves the downstream task.<n>We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations.
arXiv Detail & Related papers (2025-06-03T06:50:08Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series [86.31735321970481]
We open-source MAP-Neo, a bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens.
Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs.
arXiv Detail & Related papers (2024-05-29T17:57:16Z) - Logits of API-Protected LLMs Leak Proprietary Information [46.014638838911566]
Large language model (LLM) providers often hide the architectural details and parameters of their proprietary models by restricting public access to a limited API.
We show that it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries.
arXiv Detail & Related papers (2024-03-14T16:27:49Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.