Related papers: 6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks

6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks

URL: http://arxiv.org/abs/2602.08675v1
Date: Mon, 09 Feb 2026 13:57:37 GMT
Title: 6G-Bench: An Open Benchmark for Semantic Communication and Network-Level Reasoning with Foundation Models in AI-Native 6G Networks
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah,
Abstract summary: 6G-Bench is an open benchmark for evaluating semantic communication and network-level reasoning in AI-native 6G networks.<n>We generate a balanced pool of 10,000 very-hard multiple-choice questions using task-conditioned prompts.<n>We evaluate 22 foundation models spanning dense and mixture-of-experts architectures, short-context and long-context designs.
Score: 3.099103925863002
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces 6G-Bench, an open benchmark for evaluating semantic communication and network-level reasoning in AI-native 6G networks. 6G-Bench defines a taxonomy of 30 decision-making tasks (T1--T30) extracted from ongoing 6G and AI-agent standardization activities in 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, and organizes them into five standardization-aligned capability categories. Starting from 113,475 scenarios, we generate a balanced pool of 10,000 very-hard multiple-choice questions using task-conditioned prompts that enforce multi-step quantitative reasoning under uncertainty and worst-case regret minimization over multi-turn horizons. After automated filtering and expert human validation, 3,722 questions are retained as a high-confidence evaluation set, while the full pool is released to support training and fine-tuning of 6G-specialized models. Using 6G-Bench, we evaluate 22 foundation models spanning dense and mixture-of-experts architectures, short- and long-context designs (up to 1M tokens), and both open-weight and proprietary systems. Across models, deterministic single-shot accuracy (pass@1) spans a wide range from 0.22 to 0.82, highlighting substantial variation in semantic reasoning capability. Leading models achieve intent and policy reasoning accuracy in the range 0.87--0.89, while selective robustness analysis on reasoning-intensive tasks shows pass@5 values ranging from 0.20 to 0.91. To support open science and reproducibility, we release the 6G-Bench dataset on GitHub: https://github.com/maferrag/6G-Bench

Related papers

How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks [3.099103925863002]
We study the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems.<n>We evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B.
arXiv Detail & Related papers (2026-03-02T18:19:49Z)
UI-Venus-1.5 Technical Report [64.4832043785725]
We present UI-Venus-1.5, a unified, end-to-end GUI Agent.<n>The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B)<n>In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps.
arXiv Detail & Related papers (2026-02-09T18:43:40Z)
Efficient Multi-Model Orchestration for Self-Hosted Large Language Models [2.3275796286410677]
Pick and Spin is a framework that makes self-hosted orchestration and economical.<n>It integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module.<n>It achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower cost per query compared with static deployments of the same models.
arXiv Detail & Related papers (2025-12-26T22:42:40Z)
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes [60.57770396565211]
We show that strong reasoning abilities can emerge with far less data.<n>MobileLLM-R50M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B.
arXiv Detail & Related papers (2025-09-29T15:43:59Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Generative AI Enabled Matching for 6G Multiple Access [51.00960374545361]
We propose a GenAI-enabled matching generation framework to support 6G multiple access. We show that our framework can generate more effective matching strategies based on given conditions and predefined rewards.
arXiv Detail & Related papers (2024-10-29T13:01:26Z)
Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses. In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases. Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks. All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Toward 6G Native-AI Network: Foundation Model based Cloud-Edge-End Collaboration Framework [55.73948386625618]
We analyze the challenges of achieving 6G native AI from perspectives of data, AI models, and operational paradigm.<n>We propose a 6G native AI framework based on foundation models, provide an integration method for the expert knowledge, present the customization for two kinds of PFM, and outline a novel operational paradigm for the native AI framework.
arXiv Detail & Related papers (2023-10-26T15:19:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.