CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
- URL: http://arxiv.org/abs/2508.02427v1
- Date: Mon, 04 Aug 2025 13:48:32 GMT
- Title: CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
- Authors: Tung-Thuy Pham, Duy-Quan Luong, Minh-Quan Duong, Trung-Hieu Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo,
- Abstract summary: Composable AI offers a scalable and effective paradigm for tackling complex AI tasks.<n>We introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks.<n>We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions.
- Score: 5.372827470241613
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.
Related papers
- Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z) - Agentic AI Sustainability Assessment for Supply Chain Document Insights [0.0]
This paper presents a comprehensive sustainability assessment framework for document intelligence within supply chain operations centered on agentic artificial intelligence (AI)<n>We address the dual objective of improving automation efficiency while providing measurable environmental performance in document-intensive paper extraction.<n>We show that AI-assisted HITL and agentic AI scenarios achieve reductions of up to 7 in energy consumption, 90-97% in carbon dioxide emissions, and 89-98% in water usage compared to manual processes.
arXiv Detail & Related papers (2025-11-10T13:38:08Z) - An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z) - Barbarians at the Gate: How AI is Upending Systems Research [58.95406995634148]
We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery.<n>We term this approach as AI-Driven Research for Systems ( ADRS), which iteratively generates, evaluates, and refines solutions.<n>Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.
arXiv Detail & Related papers (2025-10-07T17:49:24Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries [38.56775962026289]
We present LiveMCP-101, a benchmark of 101 carefully curated real-world queries.<n>Experiments show that even frontier LLMs achieve a success rate below 60%.<n>LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities.
arXiv Detail & Related papers (2025-08-21T17:55:54Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations [0.0]
This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents.<n>We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone.<n>Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.
arXiv Detail & Related papers (2025-05-28T17:50:01Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - AI Benchmarks and Datasets for LLM Evaluation [0.46960837342692324]
The EU AI Act citeEUAIAct by the European Parliament on March 13, 2024, establishes the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems.<n>It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems.<n>We have launched a project that is part of the AI Safety Bulgaria initiatives citeAI_Safety_Bulgaria, aimed at collecting and categorizing AI benchmarks.
arXiv Detail & Related papers (2024-12-02T00:38:57Z) - The Foundations of Computational Management: A Systematic Approach to
Task Automation for the Integration of Artificial Intelligence into Existing
Workflows [55.2480439325792]
This article introduces Computational Management, a systematic approach to task automation.
The article offers three easy step-by-step procedures to begin the process of implementing AI within a workflow.
arXiv Detail & Related papers (2024-02-07T01:45:14Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Scalable AI Safety via Doubly-Efficient Debate [37.25328923531058]
The emergence of pre-trained AI systems with powerful capabilities has raised a critical challenge for AI safety.
The original framework was based on the assumption that honest strategy is able to simulate AI systems for an exponential number of steps.
We show how to address these challenges by designing a new set of protocols.
arXiv Detail & Related papers (2023-11-23T17:46:30Z) - SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving [64.38649623473626]
Large Language Models (LLMs) have driven substantial progress in artificial intelligence.
We propose a novel framework called textbfSEquential subtextbfGoal textbfOptimization (SEGO) to enhance LLMs' ability to solve mathematical problems.
arXiv Detail & Related papers (2023-10-19T17:56:40Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Exploring Viable Algorithmic Options for Learning from Demonstration
(LfD): A Parameterized Complexity Approach [0.0]
In this paper, we show how such a systematic exploration of algorithmic options can be done using parameterized complexity analysis.
We show that none of our problems can be solved efficiently either in general or relative to a number of (often simultaneous) restrictions on environments, demonstrations, and policies.
arXiv Detail & Related papers (2022-05-10T15:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.