Break Out the Silverware -- Semantic Understanding of Stored Household Items
- URL: http://arxiv.org/abs/2512.23739v1
- Date: Thu, 25 Dec 2025 15:21:49 GMT
- Title: Break Out the Silverware -- Semantic Understanding of Stored Household Items
- Authors: Michaela Levi-Richter, Reuth Mirsky, Oren Glickman,
- Abstract summary: Stored Household Item Challenge is a benchmark task for evaluating service robots' cognitive capabilities.<n>We introduce NOAM, a hybrid agent pipeline that combines structured scene understanding with large language model inference.
- Score: 5.413873477820601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.
Related papers
- PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents [47.44972258523047]
PersONAL is a benchmark to study personalization in Embodied AI.<n>It comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset.<n>The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes.
arXiv Detail & Related papers (2025-09-24T07:39:16Z) - LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps [24.822276221016832]
We propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs.<n>We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks.
arXiv Detail & Related papers (2025-03-15T18:54:06Z) - Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment [18.256529559741075]
Large Language Models (LLMs) serve as key components in agent systems, where their common-sense knowledge significantly impacts performance as language-based planners for situated or embodied action.<n>We assess LLMs' incremental learning (based on feedback from the environment), and controlled in-context learning abilities using a text-based environment.<n>Results show that larger commercial models have a substantial gap in performance compared to open-weight but almost all models struggle with the synthetic words experiments.
arXiv Detail & Related papers (2025-02-17T12:20:39Z) - Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications [14.89043819048682]
We see three core challenges in the future of space robotics that motivate building FM for space robotics.<n>As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations.<n>We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
arXiv Detail & Related papers (2024-08-12T05:07:24Z) - Improving Zero-Shot ObjectNav with Generative Communication [60.84730028539513]
We propose a new method for improving zero-shot ObjectNav.
Our approach takes into account that the ground agent may have limited and sometimes obstructed view.
arXiv Detail & Related papers (2024-08-03T22:55:26Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z) - HomeRobot: Open-Vocabulary Mobile Manipulation [107.05702777141178]
Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location.
HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch.
arXiv Detail & Related papers (2023-06-20T14:30:32Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.