Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue
- URL: http://arxiv.org/abs/2210.04443v2
- Date: Tue, 11 Oct 2022 20:16:41 GMT
- Title: Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue
- Authors: So Yeon Min, Hao Zhu, Ruslan Salakhutdinov and Yonatan Bisk
- Abstract summary: Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
- Score: 92.01165203498299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied dialogue instruction following requires an agent to complete a
complex sequence of tasks from a natural language exchange. The recent
introduction of benchmarks (Padmakumar et al., 2022) raises the question of how
best to train and evaluate models for this multi-turn, multi-agent,
long-horizon task. This paper contributes to that conversation, by arguing that
imitation learning (IL) and related low-level metrics are actually misleading
and do not align with the goals of embodied dialogue research and may hinder
progress. We provide empirical comparisons of metrics, analysis of three
models, and make suggestions for how the field might best progress. First, we
observe that models trained with IL take spurious actions during evaluation.
Second, we find that existing models fail to ground query utterances, which are
essential for task completion. Third, we argue evaluation should focus on
higher-level semantic goals.
Related papers
- Ranking Large Language Models without Ground Truth [24.751931637152524]
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models.
We provide a novel perspective where, given a dataset of prompts, we rank them without access to any ground truth or reference responses.
Applying this idea repeatedly, we propose two methods to rank LLMs.
arXiv Detail & Related papers (2024-02-21T00:49:43Z) - Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting [0.2552922646705803]
Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks.
The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
arXiv Detail & Related papers (2023-09-30T06:25:27Z) - Stabilized In-Context Learning with Pre-trained Language Models for Few
Shot Dialogue State Tracking [57.92608483099916]
Large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks.
For more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial.
We introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query.
arXiv Detail & Related papers (2023-02-12T15:05:10Z) - DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks.
To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z) - WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation
for Multi-turn Dialogue [17.663449579168297]
We simulate a dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other.
The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses.
Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation.
arXiv Detail & Related papers (2021-08-01T08:00:45Z) - The Interplay of Task Success and Dialogue Quality: An in-depth
Evaluation in Task-Oriented Visual Dialogues [6.02280861819024]
We show that in the popular end-to-end approach, this choice prevents the model from learning to generate linguistically richer dialogues.
We show that in GuessWhat, models could increase their accuracy if they learn to ground, encode, and decode also words that do not occur frequently in the training set.
arXiv Detail & Related papers (2021-03-20T10:13:30Z) - Probing Task-Oriented Dialogue Representation from Language Models [106.02947285212132]
This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks.
We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way.
arXiv Detail & Related papers (2020-10-26T21:34:39Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.