Related papers: FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

URL: http://arxiv.org/abs/2506.15635v1
Date: Wed, 18 Jun 2025 17:06:28 GMT
Title: FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Authors: Karmesh Yadav, Yusuf Ali, Gunshi Gupta, Yarin Gal, Zsolt Kira,
Abstract summary: We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
Score: 49.89792845476579
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

Related papers

HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback [0.0]
We introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory.<n>Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video.
arXiv Detail & Related papers (2025-07-25T03:28:05Z)
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents [84.62985963113245]
We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks.<n>At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning.<n>We show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task.
arXiv Detail & Related papers (2025-06-18T19:44:46Z)
RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation [80.20970723577818]
We introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation.<n>The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences.<n>Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations.
arXiv Detail & Related papers (2025-06-07T06:15:49Z)
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model [83.70640091897947]
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences.<n>Current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments.<n>We propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions.
arXiv Detail & Related papers (2025-05-28T17:59:13Z)
Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z)
Minerva: A Programmable Memory Test Benchmark for Language Models [18.474144165594225]
We present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively.<n>We evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory.<n>We also design composite tests to investigate the models' ability to perform more complex, integrated tasks.
arXiv Detail & Related papers (2025-02-05T16:53:45Z)
Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z)
Self-Updatable Large Language Models by Integrating Context into Model Parameters [21.742149718161716]
Small-scale experiences, such as interactions with surrounding objects, require frequent integration in large language models.<n>Current methods embed experiences within model parameters using continual learning, model editing, or knowledge distillation techniques.<n>We propose SELF-PARAM, which embeds experiences directly into model parameters and ensures near-optimal efficacy and long-term retention.
arXiv Detail & Related papers (2024-10-01T08:18:17Z)
A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented, Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z)
Learning to Actively Reduce Memory Requirements for Robot Control Tasks [4.302265156822829]
State-of-the-art approaches for controlling robots often use memory representations that are excessively rich for the task or rely on hand-crafted tricks for memory efficiency. This work provides a general approach for jointly synthesizing memory representations and policies.
arXiv Detail & Related papers (2020-08-17T16:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.