Learning Compositional Representation for Few-shot Visual Question
Answering
- URL: http://arxiv.org/abs/2102.10575v1
- Date: Sun, 21 Feb 2021 10:16:24 GMT
- Title: Learning Compositional Representation for Few-shot Visual Question
Answering
- Authors: Dalu Guo, Dacheng Tao
- Abstract summary: Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
- Score: 93.4061107793983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current methods of Visual Question Answering perform well on the answers with
an amount of training data but have limited accuracy on the novel ones with few
examples. However, humans can quickly adapt to these new categories with just a
few glimpses, as they learn to organize the concepts that have been seen before
to figure the novel class, which are hardly explored by the deep learning
methods. Therefore, in this paper, we propose to extract the attributes from
the answers with enough data, which are later composed to constrain the
learning of the few-shot ones. We generate the few-shot dataset of VQA with a
variety of answers and their attributes without any human effort. With this
dataset, we build our attribute network to disentangle the attributes by
learning their features from parts of the image instead of the whole one.
Experimental results on the VQA v2.0 validation dataset demonstrate the
effectiveness of our proposed attribute network and the constraint between
answers and their corresponding attributes, as well as the ability of our
method to handle the answers with few training examples.
Related papers
- UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Learning Concise and Descriptive Attributes for Visual Recognition [25.142065847381758]
We show that querying thousands of attributes can achieve performance competitive with image features.
We propose a novel learning-to-search method to discover those concise sets of attributes.
arXiv Detail & Related papers (2023-08-07T16:00:22Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets [5.45761450227064]
We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
arXiv Detail & Related papers (2022-10-13T15:01:15Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Can I see an Example? Active Learning the Long Tail of Attributes and
Relations [64.50739983632006]
We introduce a novel incremental active learning framework that asks for attributes and relations in visual scenes.
While conventional active learning methods ask for labels of specific examples, we flip this framing to allow agents to ask for examples from specific categories.
Using this framing, we introduce an active sampling method that asks for examples from the tail of the data distribution and show that it outperforms classical active learning methods on Visual Genome.
arXiv Detail & Related papers (2022-03-11T19:28:19Z) - Discovering the Unknown Knowns: Turning Implicit Knowledge in the
Dataset into Explicit Training Examples for Visual Question Answering [18.33311267792116]
We find that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly.
We present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA.
arXiv Detail & Related papers (2021-09-13T16:56:43Z) - Self-Supervised VQA: Answering Visual Questions using Images and
Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training.
We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.