Visual Goal-Step Inference using wikiHow
- URL: http://arxiv.org/abs/2104.05845v1
- Date: Mon, 12 Apr 2021 22:20:09 GMT
- Title: Visual Goal-Step Inference using wikiHow
- Authors: Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar,
Chris Callison-Burch
- Abstract summary: Inferring the sub-sequence of steps of a goal can help artificial intelligence systems reason about human activities.
We propose the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images.
We show that the knowledge learned from our data can effectively transfer to other datasets like HowTo100M, increasing the multiple-choice accuracy by 15% to 20%.
- Score: 29.901908251322684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Procedural events can often be thought of as a high level goal composed of a
sequence of steps. Inferring the sub-sequence of steps of a goal can help
artificial intelligence systems reason about human activities. Past work in NLP
has examined the task of goal-step inference for text. We introduce the visual
analogue. We propose the Visual Goal-Step Inference (VGSI) task where a model
is given a textual goal and must choose a plausible step towards that goal from
among four candidate images. Our task is challenging for state-of-the-art
muitimodal models. We introduce a novel dataset harvested from wikiHow that
consists of 772,294 images representing human actions. We show that the
knowledge learned from our data can effectively transfer to other datasets like
HowTo100M, increasing the multiple-choice accuracy by 15% to 20%. Our task will
facilitate multi-modal reasoning about procedural events.
Related papers
- ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - SAGA: A Participant-specific Examination of Story Alternatives and Goal Applicability for a Deeper Understanding of Complex Events [13.894639630989563]
We argue that such knowledge can be elicited through a participant achievement lens.
We analyze a complex event in a narrative according to the intended achievements of the participants.
We show that smaller models fine-tuned on our dataset can achieve performance surpassing larger models.
arXiv Detail & Related papers (2024-08-11T14:52:40Z) - Pretrained Language Models as Visual Planners for Human Assistance [12.8775186900555]
Visual Planning for Assistance (VPA) is a tool for guiding users to achieve complex multi-step goals.
We decompose VPA into video action segmentation and forecasting.
This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics.
arXiv Detail & Related papers (2023-04-17T18:07:36Z) - Are All Steps Equally Important? Benchmarking Essentiality Detection of
Events [92.92425231146433]
This paper examines the extent to which current models comprehend the essentiality of step events in relation to a goal event.
We contribute a high-quality corpus of (goal, step) pairs gathered from the community guideline website WikiHow.
The high inter-annotator agreement demonstrates that humans possess a consistent understanding of event essentiality.
arXiv Detail & Related papers (2022-10-08T18:00:22Z) - Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities.
This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps.
Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z) - Can Foundation Models Perform Zero-Shot Task Specification For Robot
Manipulation? [54.442692221567796]
Task specification is critical for engagement of non-expert end-users and adoption of personalized robots.
A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene.
In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use.
arXiv Detail & Related papers (2022-04-23T19:39:49Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Automatic Curriculum Learning through Value Disagreement [95.19299356298876]
Continually solving new, unsolved tasks is the key to learning diverse behaviors.
In the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency.
We propose setting up an automatic curriculum for goals that the agent needs to solve.
We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T03:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.