MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
- URL: http://arxiv.org/abs/2110.06416v1
- Date: Wed, 13 Oct 2021 00:57:05 GMT
- Title: MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
- Authors: Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen, Nick Tzou, Hadas
Kotek, Vincent Renkens
- Abstract summary: We provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided by human annotators while looking at images.
We, then, use this dataset for intent classification task in multimodal digital assistant.
- Score: 4.322454918650574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In multimodal assistant, where vision is also one of the input modalities,
the identification of user intent becomes a challenging task as visual input
can influence the outcome. Current digital assistants take spoken input and try
to determine the user intent from conversational or device context. So, a
dataset, which includes visual input (i.e. images or videos for the
corresponding questions targeted for multimodal assistant use cases, is not
readily available. The research in visual question answering (VQA) and visual
question generation (VQG) is a great step forward. However, they do not capture
questions that a visually-abled person would ask multimodal assistants.
Moreover, many times questions do not seek information from external knowledge.
In this paper, we provide a new dataset, MMIU (MultiModal Intent
Understanding), that contains questions and corresponding intents provided by
human annotators while looking at images. We, then, use this dataset for intent
classification task in multimodal digital assistant. We also experiment with
various approaches for combining vision and language features including the use
of multimodal transformer for classification of image-question pairs into 14
intents. We provide the benchmark results and discuss the role of visual and
text features for the intent classification task on our dataset.
Related papers
- Asking Multimodal Clarifying Questions in Mixed-Initiative
Conversational Search [89.1772985740272]
In mixed-initiative conversational search systems, clarifying questions are used to help users who struggle to express their intentions in a single query.
We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information.
We collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images.
Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase.
arXiv Detail & Related papers (2024-02-12T16:04:01Z) - IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts.
We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions.
During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese [2.7528170226206443]
We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese.
The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs)
Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
arXiv Detail & Related papers (2023-05-07T03:59:31Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Generating Natural Questions from Images for Multimodal Assistants [4.930442416763205]
We present an approach for generating diverse and meaningful questions that consider image content and metadata of image.
We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr.
arXiv Detail & Related papers (2020-11-17T19:12:23Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents.
We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs.
MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.