Related papers: CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models

CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models

URL: http://arxiv.org/abs/2508.02427v1
Date: Mon, 04 Aug 2025 13:48:32 GMT
Title: CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
Authors: Tung-Thuy Pham, Duy-Quan Luong, Minh-Quan Duong, Trung-Hieu Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo,
Abstract summary: Composable AI offers a scalable and effective paradigm for tackling complex AI tasks.<n>We introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks.<n>We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions.
Score: 5.372827470241613
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.

Related papers

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations [0.0]
This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents.<n>We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone.<n>Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.
arXiv Detail & Related papers (2025-05-28T17:50:01Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
AI Benchmarks and Datasets for LLM Evaluation [0.46960837342692324]
The EU AI Act citeEUAIAct by the European Parliament on March 13, 2024, establishes the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems.<n>It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems.<n>We have launched a project that is part of the AI Safety Bulgaria initiatives citeAI_Safety_Bulgaria, aimed at collecting and categorizing AI benchmarks.
arXiv Detail & Related papers (2024-12-02T00:38:57Z)
The Foundations of Computational Management: A Systematic Approach to Task Automation for the Integration of Artificial Intelligence into Existing Workflows [55.2480439325792]
This article introduces Computational Management, a systematic approach to task automation. The article offers three easy step-by-step procedures to begin the process of implementing AI within a workflow.
arXiv Detail & Related papers (2024-02-07T01:45:14Z)
Machine Learning Insides OptVerse AI Solver: Design Principles and Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z)
Scalable AI Safety via Doubly-Efficient Debate [37.25328923531058]
The emergence of pre-trained AI systems with powerful capabilities has raised a critical challenge for AI safety. The original framework was based on the assumption that honest strategy is able to simulate AI systems for an exponential number of steps. We show how to address these challenges by designing a new set of protocols.
arXiv Detail & Related papers (2023-11-23T17:46:30Z)
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving [64.38649623473626]
Large Language Models (LLMs) have driven substantial progress in artificial intelligence. We propose a novel framework called textbfSEquential subtextbfGoal textbfOptimization (SEGO) to enhance LLMs' ability to solve mathematical problems.
arXiv Detail & Related papers (2023-10-19T17:56:40Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Exploring Viable Algorithmic Options for Learning from Demonstration (LfD): A Parameterized Complexity Approach [0.0]
In this paper, we show how such a systematic exploration of algorithmic options can be done using parameterized complexity analysis. We show that none of our problems can be solved efficiently either in general or relative to a number of (often simultaneous) restrictions on environments, demonstrations, and policies.
arXiv Detail & Related papers (2022-05-10T15:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.