EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
- URL: http://arxiv.org/abs/2506.05287v1
- Date: Thu, 05 Jun 2025 17:44:12 GMT
- Title: EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
- Authors: Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang,
- Abstract summary: multimodal large language models (MLLMs) have driven breakthroughs in egocentric vision applications.<n>EOC-Bench is an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.<n>We conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs based on EOC-Bench.
- Score: 52.99661576320663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object's appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users' interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.
Related papers
- A Multi-Level Similarity Approach for Single-View Object Grasping: Matching, Planning, and Fine-Tuning [17.162675084829242]
We propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps.<n>We propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation.<n>In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions.
arXiv Detail & Related papers (2025-07-16T06:07:57Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [48.24550684610705]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [47.237216851265316]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - SITE: towards Spatial Intelligence Thorough Evaluation [121.1493852562597]
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships.<n>We introduce SITE, a benchmark dataset towards SI Thorough Evaluation.<n>Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science.
arXiv Detail & Related papers (2025-05-08T17:45:44Z) - V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark [44.64084739916821]
ECBench is a benchmark designed to systematically evaluate the embodied cognitive abilities of large vision-language models (LVLMs)<n>ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition.<n>We conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs.
arXiv Detail & Related papers (2025-01-09T07:43:49Z) - TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction [29.72874725703848]
Large language models (LLMs) are increasingly deployed to various vertical domains.<n>Current evaluation methods rely on static and resource-intensive datasets that are not aligned with real-world requirements.<n>We introduce two key concepts: textbfBenchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format.<n>We propose textbftextscTestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Monocular Per-Object Distance Estimation with Masked Object Modeling [33.59920084936913]
Our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks.<n>Our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques.<n>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOT Synth datasets.
arXiv Detail & Related papers (2024-01-06T10:56:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.