Related papers: BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

URL: http://arxiv.org/abs/2501.10836v1
Date: Sat, 18 Jan 2025 18:06:03 GMT
Title: BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues
Authors: Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier,
Abstract summary: The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal.<n>We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a multimodal game context.<n>We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task.
Score: 7.377606500245465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive agents capable of understanding and executing instructions in the physical world have long been a central goal in AI research. The Minecraft Collaborative Building Task (MCBT) provides one such setting to work towards this goal (Narayan-Chen, Jayannavar, and Hockenmaier 2019). It is a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated Blocks World Environment. We focus on the challenging Builder Action Prediction (BAP) subtask of predicting correct action sequences in a given multimodal game context with limited training data (Jayannavar, Narayan-Chen, and Hockenmaier 2020). We take a closer look at evaluation and data for the BAP task, discovering key challenges and making significant improvements on both fronts to propose BAP v2, an upgraded version of the task. This will allow future work to make more efficient and meaningful progress on it. It comprises of: (1) an enhanced evaluation benchmark that includes a cleaner test set and fairer, more insightful metrics, and (2) additional synthetic training data generated from novel Minecraft dialogue and target structure simulators emulating the MCBT. We show that the synthetic data can be used to train more performant and robust neural models even with relatively simple training methods. Looking ahead, such data could also be crucial for training more sophisticated, data-hungry deep transformer models and training/fine-tuning increasingly large LLMs. Although modeling is not the primary focus of this work, we also illustrate the impact of our data and training methodologies on a simple LLM- and transformer-based model, thus validating the robustness of our approach, and setting the stage for more advanced architectures and LLMs going forward.

Related papers

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [34.138699712315]
This paper introduces a novel vision--action (VLA) model, mixture of robotic experts (MoRE) for quadruped robots. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model. Experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios.
arXiv Detail & Related papers (2025-03-11T03:13:45Z)
YuLan-Mini: An Open Data-efficient Language Model [111.02822724500552]
YuLan-Mini, a highly capable base model with 2.42B parameters, achieves top-tier performance among models of similar parameter scale. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.
arXiv Detail & Related papers (2024-12-23T17:47:53Z)
Synthetic Vision: Training Vision-Language Models to Understand Physics [9.474337395173388]
We propose two methods to enhance Vision-Language Models' physical reasoning capabilities using simulated data.<n>First, we fine-tune a pre-trained VLM using question-answer pairs generated from simulations relevant to physical reasoning tasks.<n>Second, we introduce Physics Context Builders (PCBs) to create scene descriptions enriched with physical properties and processes.
arXiv Detail & Related papers (2024-12-11T18:40:16Z)
Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation [12.736045604858738]
Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data.<n>We propose a paradigm shift named textbfNOMAD by investigating how to specifically train models for data generation.
arXiv Detail & Related papers (2024-10-27T07:38:39Z)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs) MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
A Backbone for Long-Horizon Robot Task Understanding [8.889888977376886]
Therblig-Based Backbone Framework (TBBF) is a structure to enhance interpretability, data efficiency, and generalization in robotic systems. TBBF utilizes expert demonstrations to enable therblig-level task decomposition. During the offline training stage, we developed the Meta-RGate SynerFusion network for accurate therblig segmentation. In the online testing stage, after a one-shot demonstration of a new task is collected, our MGSF network extracts high-level knowledge.
arXiv Detail & Related papers (2024-08-02T15:32:42Z)
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs [62.496139001509114]
We introduce a novel tool invocation pipeline designed to control massive real-world APIs. This pipeline mirrors the human task-solving process, addressing complicated real-life user queries. Empirical evaluations of our Sum2Act pipeline on the ToolBench benchmark show significant performance improvements.
arXiv Detail & Related papers (2024-02-28T08:42:23Z)
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI. Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z)
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora. We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [119.70303730341938]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing. ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z)
Reinforcement Learning Friendly Vision-Language Model for Minecraft [31.863271032186038]
We propose a novel cross-modal contrastive learning framework architecture, CLIP4MC. We aim to learn a reinforcement learning (RL) friendly vision-language model (VLM) that serves as an intrinsic reward function for open-ended tasks. We demonstrate that the proposed method achieves better performance on RL tasks compared with baselines.
arXiv Detail & Related papers (2023-03-19T05:20:52Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.