Related papers: WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

URL: http://arxiv.org/abs/2505.14354v1
Date: Tue, 20 May 2025 13:38:10 GMT
Title: WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications
Authors: Xin Li, Mengbing Liu, Li Wei, Jiancheng An, Mérouane Debbah, Chau Yuen,
Abstract summary: We introduce WirelessMathBench, a novel benchmark designed to evaluate Large Language Models (LLMs)<n>Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers.<n>Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion.
Score: 39.029769739081495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.

Related papers

TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving [8.461584378073637]
We introduce TeleMath, the first benchmark dataset specifically designed to evaluate Large Language Models (LLMs) performance in solving mathematical problems.<n>This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts.<n>The evaluation reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning.
arXiv Detail & Related papers (2025-06-12T13:04:18Z)
Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing [19.577278316436807]
Large Language Models (LLMs) are limited by the context window size.<n>We propose a novel method that leverages the LLMs's own attention information to enable accurate retrieval.<n>InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model.
arXiv Detail & Related papers (2025-02-18T15:45:36Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.<n>The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.<n>We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.<n>Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.<n>We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios. We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z)
WDMoE: Wireless Distributed Large Language Models with Mixture of Experts [65.57581050707738]
We propose a wireless distributed Large Language Models (LLMs) paradigm based on Mixture of Experts (MoE) We decompose the MoE layer in LLMs by deploying the gating network and the preceding neural network layer at base station (BS) and mobile devices. We design an expert selection policy by taking into account both the performance of the model and the end-to-end latency.
arXiv Detail & Related papers (2024-05-06T02:55:50Z)
MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting [0.6675160100853794]
We curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. For generating answers for questions consisting of multimodal input, we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models.
arXiv Detail & Related papers (2024-04-11T07:11:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.