On the Limits of Evaluating Embodied Agent Model Generalization Using
Validation Sets
- URL: http://arxiv.org/abs/2205.09249v1
- Date: Wed, 18 May 2022 23:52:21 GMT
- Title: On the Limits of Evaluating Embodied Agent Model Generalization Using
Validation Sets
- Authors: Hyounghun Kim, Aishwarya Padmakumar, Di Jin, Mohit Bansal, Dilek
Hakkani-Tur
- Abstract summary: This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action.
We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED.
We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
- Score: 101.28658250723804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language guided embodied task completion is a challenging problem
since it requires understanding natural language instructions, aligning them
with egocentric visual observations, and choosing appropriate actions to
execute in the environment to produce desired changes. We experiment with
augmenting a transformer model for this task with modules that effectively
utilize a wider field of view and learn to choose whether the next step
requires a navigation or manipulation action. We observed that the proposed
modules resulted in improved, and in fact state-of-the-art performance on an
unseen validation set of a popular benchmark dataset, ALFRED. However, our best
model selected using the unseen validation set underperforms on the unseen test
split of ALFRED, indicating that performance on the unseen validation set may
not in itself be a sufficient indicator of whether model improvements
generalize to unseen test sets. We highlight this result as we believe it may
be a wider phenomenon in machine learning tasks but primarily noticeable only
in benchmarks that limit evaluations on test splits, and highlights the need to
modify benchmark design to better account for variance in model performance.
Related papers
- Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.
We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.
We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z) - Increasing Performance And Sample Efficiency With Model-agnostic
Interactive Feature Attributions [3.0655581300025996]
We provide model-agnostic implementations for two popular explanation methods (Occlusion and Shapley values) to enforce entirely different attributions in the complex model.
We show how our proposed approach can significantly improve the model's performance only by augmenting its training dataset based on corrected explanations.
arXiv Detail & Related papers (2023-06-28T15:23:28Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Are you doing what I say? On modalities alignment in ALFRED [6.46147328920679]
ALFRED requires a model to complete tasks in simulated house environments specified by instructions in natural language.
Key modality to success is accurately aligning the text with visual inputs.
We introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
arXiv Detail & Related papers (2021-10-12T01:05:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.