Related papers: OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

URL: http://arxiv.org/abs/2508.05614v1
Date: Thu, 07 Aug 2025 17:54:15 GMT
Title: OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Authors: Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang,
Abstract summary: We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
Score: 52.87238755666243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

Related papers

Knowledge Grafting: A Mechanism for Optimizing AI Model Deployment in Resource-Constrained Environments [0.0]
We introduce knowledge grafting to optimize AI models for resource-constrained environments.<n>The approach achieves an 88.54% reduction in model size.<n>It can be extended across various edge computing scenarios.
arXiv Detail & Related papers (2025-07-25T13:37:45Z)
SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments [2.184775414778289]
We introduce setupbench, a benchmark that isolates the environment-bootstrap skill.<n>Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios.<n>We find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%)
arXiv Detail & Related papers (2025-07-11T22:45:07Z)
Towards Building General Purpose Embedding Models for Industry 4.0 Agents [5.212780106286918]
We focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime.<n>Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize queries of similar assets.<n>Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets.
arXiv Detail & Related papers (2025-06-14T19:02:07Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
EmbodiedAgent: A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control [4.163413782205929]
EmbodiedAgent is a hierarchical framework for heterogeneous multi-robot control.<n>Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills.<n>We present MultiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios.
arXiv Detail & Related papers (2025-04-14T09:33:42Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks [57.89516354418451]
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) We employ a semi-automated task generation pipeline using Large Language Models (LLMs) We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution.
arXiv Detail & Related papers (2024-10-31T17:53:12Z)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities. We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web [69.6913064185993]
Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks.<n>Despite the promise, their performance on real-world applications is still underexplored.<n>We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
arXiv Detail & Related papers (2023-11-30T17:50:47Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.