VTQA: Visual Text Question Answering via Entity Alignment and
Cross-Media Reasoning
- URL: http://arxiv.org/abs/2303.02635v1
- Date: Sun, 5 Mar 2023 10:32:26 GMT
- Title: VTQA: Visual Text Question Answering via Entity Alignment and
Cross-Media Reasoning
- Authors: Kang Chen, Xiangqian Wu
- Abstract summary: We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs.
The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.
- Score: 21.714382546678053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ideal form of Visual Question Answering requires understanding, grounding
and reasoning in the joint space of vision and language and serves as a proxy
for the AI task of scene understanding. However, most existing VQA benchmarks
are limited to just picking the answer from a pre-defined set of options and
lack attention to text. We present a new challenge with a dataset that contains
23,781 questions based on 10124 image-text pairs. Specifically, the task
requires the model to align multimedia representations of the same entity to
implement multi-hop reasoning between image and text and finally use natural
language to answer the question. The aim of this challenge is to develop and
benchmark models that are capable of multimedia entity alignment, multi-step
reasoning and open-ended answer generation.
Related papers
- CommVQA: Situating Visual Question Answering in Communicative Contexts [16.180130883242672]
We introduce CommVQA, a dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear.
We show that access to contextual information is essential for solving CommVQA, leading to the highest performing VQA model.
arXiv Detail & Related papers (2024-02-22T22:31:39Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Unified Questioner Transformer for Descriptive Question Generation in
Goal-Oriented Visual Dialogue [0.0]
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems.
We propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer)
We build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions.
arXiv Detail & Related papers (2021-06-29T16:36:34Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multimodal grid features and cell pointers for Scene Text Visual
Question Answering [7.834170106487722]
This paper presents a new model for the task of scene text visual question answering.
It is based on an attention mechanism that attends to multi-modal features conditioned to the question.
Experiments demonstrate competitive performance in two standard datasets.
arXiv Detail & Related papers (2020-06-01T13:17:44Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.