Related papers: DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain

DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain

URL: http://arxiv.org/abs/2504.16116v2
Date: Fri, 16 May 2025 12:00:59 GMT
Title: DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain
Authors: Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Hobert Wang, Dong Dong, Gang Zhao, James Yi, Frank Li, Ziang Ling, Lowes Yang,
Abstract summary: Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks.<n>We introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields.<n>We evaluate 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen.
Score: 6.275468311396066
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks, but specialized domains such as Web3 present new challenges and require more tailored evaluation. Despite the significant user base and capital flows in Web3, encompassing smart contracts, decentralized finance (DeFi), non-fungible tokens (NFTs), decentralized autonomous organizations (DAOs), on-chain governance, and novel token-economics, no comprehensive benchmark has systematically assessed LLM performance in this domain. To address this gap, we introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields: fundamental blockchain concepts, blockchain infrastructure, smart contract, DeFi mechanisms, DAOs, NFTs, token economics, meme concept, and security vulnerabilities. Beyond multiple-choice questions, DMind Benchmark features domain-specific tasks such as contract debugging and on-chain numeric reasoning, mirroring real-world scenarios. We evaluated 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen, uncovering notable performance gaps in specialized areas like token economics and security-critical contract analysis. While some models excel in blockchain infrastructure tasks, advanced subfields remain challenging. Our benchmark dataset and evaluation pipeline are open-sourced on https://huggingface.co/datasets/DMindAI/DMind_Benchmark, reaching number one in Hugging Face's trending dataset charts within a week of release.

Related papers

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning [43.74670894224625]
FinChain is the first symbolic benchmark for verifiable Chain-of- Thought (CoT) financial reasoning.<n>FinChain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required.<n> Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement.
arXiv Detail & Related papers (2025-06-03T06:44:42Z)
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation [13.897645524385274]
BenchHub is a dynamic benchmark repository that empowers researchers and developers to evaluate large language models (LLMs) more effectively.<n>It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases.
arXiv Detail & Related papers (2025-05-31T09:24:32Z)
FinS-Pilot: A Benchmark for Online Financial System [17.65500174763836]
FinS-Pilot is a novel benchmark for evaluating large language models (RAGs) in online financial applications.<n>Our benchmark incorporates both real-time API data and structured text sources, organized through an intent classification framework.<n>Our work contributes both a practical evaluation framework and a curated dataset to advance research in financial NLP systems.
arXiv Detail & Related papers (2025-05-31T03:50:19Z)
A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms.
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber Attacks [13.082370325093242]
We introduce AttackSeqBench, a benchmark to evaluate Large Language Models' (LLMs) capability to understand and reason attack sequences in Cyber Threat Intelligence (CTI) reports. Our benchmark encompasses three distinct Question Answering (QA) tasks, each task focuses on the varying granularity in adversarial behavior. We conduct extensive experiments and analysis with both fast-thinking and slow-thinking LLMs, while highlighting their strengths and limitations in analyzing the sequential patterns in cyber attacks.
arXiv Detail & Related papers (2025-03-05T04:25:21Z)
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z)
LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection.<n>This paper provides a detailed survey of LLMs in vulnerability detection.<n>We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management [9.9661459222949]
We propose an explainable, multi-modal, multi-agent framework for cryptocurrency investment.<n>Our framework uses specialized agents that collaborate within and across teams to handle subtasks such as data analysis, literature integration, and investment decision-making.
arXiv Detail & Related papers (2025-01-01T13:08:17Z)
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity [23.32017147274093]
SecBench is a benchmarking dataset designed to evaluate Large Language Models (LLMs) in the cybersecurity domain.<n>The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest.<n> Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench.
arXiv Detail & Related papers (2024-12-30T08:11:54Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models [51.468732121824125]
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAGs.
arXiv Detail & Related papers (2024-06-17T15:59:49Z)
OpenPerf: A Benchmarking Framework for the Sustainable Development of the Open-Source Ecosystem [6.188178422139467]
OpenPerf is a benchmarking framework designed for the sustainable development of the open-source ecosystem. We implement 3 data science task benchmarks, 2 index-based benchmarks, and 1 standard benchmark. We have developed a comprehensive toolkit for OpenPerf, which offers robust data management, tool integration, and user interface capabilities.
arXiv Detail & Related papers (2023-11-26T07:01:36Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Benchmarking Graph Neural Networks [75.42159546060509]
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. For any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework.
arXiv Detail & Related papers (2020-03-02T15:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.