JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for
Conversational Embodied Agents
- URL: http://arxiv.org/abs/2208.13266v2
- Date: Tue, 30 Aug 2022 02:10:50 GMT
- Title: JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for
Conversational Embodied Agents
- Authors: Kaizhi Zheng, Kaiwen Zhou, Jing Gu, Yue Fan, Jialu Wang, Zonglin Di,
Xuehai He, Xin Eric Wang
- Abstract summary: We propose a Neuro-Symbolic Commonsense Reasoning framework for modular, generalizable, and interpretable conversational embodied agents.
Our framework achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC)
Our model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.
- Score: 14.70666899147632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building a conversational embodied agent to execute real-life tasks has been
a long-standing yet quite challenging research goal, as it requires effective
human-agent communication, multi-modal understanding, long-range sequential
decision making, etc. Traditional symbolic methods have scaling and
generalization issues, while end-to-end deep learning models suffer from data
scarcity and high task complexity, and are often hard to explain. To benefit
from both worlds, we propose a Neuro-Symbolic Commonsense Reasoning (JARVIS)
framework for modular, generalizable, and interpretable conversational embodied
agents. First, it acquires symbolic representations by prompting large language
models (LLMs) for language understanding and sub-goal planning, and by
constructing semantic maps from visual observations. Then the symbolic module
reasons for sub-goal planning and action generation based on task- and
action-level common sense. Extensive experiments on the TEACh dataset validate
the efficacy and efficiency of our JARVIS framework, which achieves
state-of-the-art (SOTA) results on all three dialog-based embodied tasks,
including Execution from Dialog History (EDH), Trajectory from Dialog (TfD),
and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen
Success Rate on EDH from 6.1\% to 15.8\%). Moreover, we systematically analyze
the essential factors that affect the task performance and also demonstrate the
superiority of our method in few-shot settings. Our JARVIS model ranks first in
the Alexa Prize SimBot Public Benchmark Challenge.
Related papers
- Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models [9.611864685207056]
We propose a novel approach, InstructERC, to reformulate the emotion recognition task from a discriminative framework to a generative framework based on Large Language Models (LLMs)
InstructERC makes three significant contributions: (1) it introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information; (2) we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations; and (3) Pioneeringly, we unify emotion labels across benchmarks through the feeling wheel to fit real application scenarios.
arXiv Detail & Related papers (2023-09-21T09:22:07Z) - From Chatter to Matter: Addressing Critical Steps of Emotion Recognition
Learning in Task-oriented Dialogue [6.918298428336528]
We propose a framework that turns a chit-chat ERC model into a task-oriented one.
We use dialogue states as auxiliary features to incorporate key information from the goal of the user.
Our framework yields significant improvements for a range of chit-chat ERC models on EmoWOZ.
arXiv Detail & Related papers (2023-08-24T08:46:30Z) - DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - A Multi-Task BERT Model for Schema-Guided Dialogue State Tracking [78.2700757742992]
Task-oriented dialogue systems often employ a Dialogue State Tracker (DST) to successfully complete conversations.
Recent state-of-the-art DST implementations rely on schemata of diverse services to improve model robustness.
We propose a single multi-task BERT-based model that jointly solves the three DST tasks of intent prediction, requested slot prediction and slot filling.
arXiv Detail & Related papers (2022-07-02T13:27:59Z) - Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System [26.837972034630003]
PPTOD is a unified plug-and-play model for task-oriented dialogue.
We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification.
arXiv Detail & Related papers (2021-09-29T22:02:18Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z) - A Simple Language Model for Task-Oriented Dialogue [61.84084939472287]
SimpleTOD is a simple approach to task-oriented dialogue that uses a single, causal language model trained on all sub-tasks recast as a single sequence prediction problem.
This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2.
arXiv Detail & Related papers (2020-05-02T11:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.