Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
- URL: http://arxiv.org/abs/2505.16928v1
- Date: Thu, 22 May 2025 17:20:38 GMT
- Title: Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
- Authors: Bosung Kim, Prithviraj Ammanabrolu,
- Abstract summary: $infty$-THOR is a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI.<n>$infty$-THOR provides: (1) a generation framework for scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack; and (3) a long-horizon dataset and benchmark suite.
- Score: 17.46846684309542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.
Related papers
- A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs [38.304628241767055]
We introduce STReason, a framework that integrates large language models with analytical capabilities for multi-task inference and execution.<n>We show that STReason significantly outperforms LLM baselines across all metrics, particularly in excelling in complex, reasoningintensive-temporal scenarios.<n>Human evaluations validate STReason's credibility and practical utility, demonstrating potential to reduce expert workload and broaden the applicability to real-world, multi-faceted decision scenarios.
arXiv Detail & Related papers (2025-06-25T00:55:34Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision [40.63870977649693]
Chain-of-Thought prompting has shown promise for multi-step reasoning, but its effectiveness for long-context scenarios remains underexplored.<n>We propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance.<n>Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios.
arXiv Detail & Related papers (2025-02-28T07:15:12Z) - Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method [94.74003109176581]
Long-Horizon Vision-Language Navigation (LH-VLN) is a novel VLN task that emphasizes long-term planning and decision consistency across consecutive subtasks.<n>Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model.
arXiv Detail & Related papers (2024-12-12T09:08:13Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - Spatial Reasoning and Planning for Deep Embodied Agents [2.7195102129095003]
This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks.
It focuses on enhancing learning efficiency, interpretability, and transferability across novel scenarios.
arXiv Detail & Related papers (2024-09-28T23:05:56Z) - Long-horizon Embodied Planning with Implicit Logical Inference and Hallucination Mitigation [7.668848364013772]
We present ReLEP, a novel framework for Real-time Long-horizon Embodied Planning.<n>ReLEP can complete a wide range of long-horizon tasks without in-context examples by learning implicit logical inference through fine-tuning.
arXiv Detail & Related papers (2024-09-24T01:47:23Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Generalizable Long-Horizon Manipulations with Large Language Models [91.740084601715]
This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations.
We create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation.
arXiv Detail & Related papers (2023-10-03T17:59:46Z) - Efficient Learning of High Level Plans from Play [57.29562823883257]
We present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL.
We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks.
arXiv Detail & Related papers (2023-03-16T20:09:47Z) - Long-HOT: A Modular Hierarchical Approach for Long-Horizon Object
Transport [83.06265788137443]
We address key challenges in long-horizon embodied exploration and navigation by proposing a new object transport task and a novel modular framework for temporally extended navigation.
Our first contribution is the design of a novel Long-HOT environment focused on deep exploration and long-horizon planning.
We propose a modular hierarchical transport policy (HTP) that builds a topological graph of the scene to perform exploration with the help of weighted frontiers.
arXiv Detail & Related papers (2022-10-28T05:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.