Related papers: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?

URL: http://arxiv.org/abs/2503.06029v2
Date: Tue, 26 Aug 2025 14:34:00 GMT
Title: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Authors: Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li,
Abstract summary: We introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts.<n>We construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions.<n>Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization.
Score: 34.225988628142225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.

Related papers

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering [7.264718073839472]
Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry.<n>We propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision.<n>DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions.
arXiv Detail & Related papers (2025-07-15T17:56:04Z)
UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs [1.4304078520604593]
Theory of Mind (ToM) remains a challenging area for large language models (LLMs)<n>In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH.
arXiv Detail & Related papers (2025-06-11T06:55:40Z)
Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant [16.006675944380078]
Auto-SLURP is a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset by relabeling the data and integrating simulated servers and external services. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks.
arXiv Detail & Related papers (2025-04-25T14:17:47Z)
Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark [45.28023118459497]
We introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set.
arXiv Detail & Related papers (2025-03-26T17:59:56Z)
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBench is a benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. Experimental results reveal that open-source MLLMs significantly underperform compared to proprietary models in creative tasks.
arXiv Detail & Related papers (2025-03-18T17:51:34Z)
LLMs in Mobile Apps: Practices, Challenges, and Opportunities [4.104646810514711]
The integration of AI techniques has become increasingly popular in software development.<n>With the rise of large language models (LLMs) and generative AI, developers now have access to a wealth of high-quality open-source models and APIs from closed-source providers.
arXiv Detail & Related papers (2025-02-21T19:53:43Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 19 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-B ENCH, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.<n> SPA-B ENCH offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices; and (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation [10.817783356090027]
Large language models (LLMs) increasingly integrate into every aspect of our work and daily lives. There are growing concerns about user privacy, which push the trend toward local deployment of these models. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices.
arXiv Detail & Related papers (2024-10-04T17:14:59Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases [81.70591346986582]
We introduce MobileAIBench, a benchmarking framework for evaluating Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices. MobileAIBench assesses models across different sizes, quantization levels, and tasks, measuring latency and resource consumption on real devices.
arXiv Detail & Related papers (2024-06-12T22:58:12Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models [26.590755599827993]
We present OpsEval, a comprehensive task-oriented Ops benchmark designed for large language models (LLMs) The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions.
arXiv Detail & Related papers (2023-10-11T16:33:29Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.