CryptoX : Compositional Reasoning Evaluation of Large Language Models
- URL: http://arxiv.org/abs/2502.07813v2
- Date: Wed, 12 Mar 2025 13:17:27 GMT
- Title: CryptoX : Compositional Reasoning Evaluation of Large Language Models
- Authors: Jiajun Shi, Chaoren Wei, Liqun Yang, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen,
- Abstract summary: We introduce CryptoX, an evaluation framework that combines existing benchmarks and cryptographic.<n>We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench.<n>We highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.
- Score: 18.927129952741904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.
Related papers
- Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z) - KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation [78.96590724864606]
We introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium.<n>KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios.
arXiv Detail & Related papers (2025-05-20T16:06:32Z) - Turing Machine Evaluation for Large Language Model [23.17949876392197]
We develop TMBench, a benchmark for systematically studying the computational reasoning capabilities of Large Language Models (LLMs)
TMBench provides several key advantages, including knowledge-agnostic evaluation, adjustable difficulty, and unlimited capacity for instance generation.
We find that model performance on TMBench correlates strongly with performance on other recognized reasoning benchmarks.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - InductionBench: LLMs Fail in the Simplest Complexity Class [53.70978746199222]
Large language models (LLMs) have shown remarkable improvements in reasoning.
Inductive reasoning, where one infers the underlying rules from observed data, remains less explored.
We introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.
arXiv Detail & Related papers (2025-02-20T03:48:00Z) - Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion [21.68354181391989]
Large language models (LLMs) need to have strong capabilities of decomposition and composition in generalized and robust natural-to-formal language conversion (N2F)<n>We propose the DEDC framework, which performs sample and task construction, allowing evaluation of the set of decomposition and composition capabilities of LLMs in N2F.<n>Our work provides a new perspective for investigating the basic capabilities of decomposition and composition of LLMs in N2F.
arXiv Detail & Related papers (2025-01-24T17:15:09Z) - Argumentation Computation with Large Language Models : A Benchmark Study [6.0682923348298194]
Large language models (LLMs) have made significant advancements in neuro-symbolic computing.<n>We aim to investigate the capability of LLMs in determining the extensions of various abstract argumentation semantics.
arXiv Detail & Related papers (2024-12-21T18:23:06Z) - On the Design and Analysis of LLM-Based Algorithms [74.7126776018275]
Large language models (LLMs) are used as sub-routines in algorithms.
LLMs have achieved remarkable empirical success.
Our proposed framework holds promise for advancing LLM-based algorithms.
arXiv Detail & Related papers (2024-07-20T07:39:07Z) - ICLEval: Evaluating In-Context Learning Ability of Large Language Models [68.7494310749199]
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs.<n>Existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability.<n>We introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning.
arXiv Detail & Related papers (2024-06-21T08:06:10Z) - CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs [27.362012903540492]
The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning.
The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning.
arXiv Detail & Related papers (2024-04-09T14:40:08Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z) - A Principled Framework for Knowledge-enhanced Large Language Model [58.1536118111993]
Large Language Models (LLMs) are versatile, yet they often falter in tasks requiring deep and reliable reasoning.
This paper introduces a rigorously designed framework for creating LLMs that effectively anchor knowledge and employ a closed-loop reasoning process.
arXiv Detail & Related papers (2023-11-18T18:10:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.