Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task
Feasibility in Interactive Visual Environments
- URL: http://arxiv.org/abs/2104.08560v1
- Date: Sat, 17 Apr 2021 14:48:02 GMT
- Title: Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task
Feasibility in Interactive Visual Environments
- Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate
Saenko, Bryan A. Plummer
- Abstract summary: We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date.
MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable.
We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations.
- Score: 54.405920619915655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, vision-language research has shifted to study tasks which
require more complex reasoning, such as interactive question answering, visual
common sense reasoning, and question-answer plausibility prediction. However,
the datasets used for these problems fail to capture the complexity of real
inputs and multimodal environments, such as ambiguous natural language requests
and diverse digital domains. We introduce Mobile app Tasks with Iterative
Feedback (MoTIF), a dataset with natural language commands for the greatest
number of interactive environments to date. MoTIF is the first to contain
natural language requests for interactive environments that are not
satisfiable, and we obtain follow-up questions on this subset to enable
research on task uncertainty resolution. We perform initial feasibility
classification experiments and only reach an F1 score of 37.3, verifying the
need for richer vision-language representations and improved architectures to
reason about task feasibility.
Related papers
- INQUIRE: A Natural World Text-to-Image Retrieval Benchmark [51.823709631153946]
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries.
InQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries.
Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals.
arXiv Detail & Related papers (2024-11-04T19:16:53Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - Interactive Natural Language Processing [67.87925315773924]
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP.
This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept.
arXiv Detail & Related papers (2023-05-22T17:18:29Z) - PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented
Dialogs [39.58414649004708]
PRESTO is a dataset of over 550K contextual multilingual conversations between humans and virtual assistants.
It contains challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions.
Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model.
arXiv Detail & Related papers (2023-03-15T21:51:13Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue [70.65782786401257]
This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer in open-domain dialogue.
FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer.
We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs.
arXiv Detail & Related papers (2022-05-12T17:59:00Z) - Interactive Mobile App Navigation with Uncertain or Under-specified
Natural Language Commands [47.282510186109775]
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a new dataset where the goal is to complete a natural language query in a mobile app.
Current datasets for related tasks in interactive question answering, visual common sense reasoning, and question-answer plausibility prediction do not support research in resolving ambiguous natural language requests.
MoTIF contains natural language requests that are not satisfiable, the first such work to investigate this issue for interactive vision-language tasks.
arXiv Detail & Related papers (2022-02-04T18:51:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.