Toloka Visual Question Answering Benchmark
- URL: http://arxiv.org/abs/2309.16511v1
- Date: Thu, 28 Sep 2023 15:18:35 GMT
- Title: Toloka Visual Question Answering Benchmark
- Authors: Dmitry Ustalov and Nikita Pavlichenko and Sergey Koshelev and Daniil
Likhobaba and Alisa Smirnova
- Abstract summary: Toloka Visual Question Answering is a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task.
Our dataset contains 45,199 pairs of images and questions in English, provided with ground truth bounding boxes, split into train and two test subsets.
- Score: 7.71562336736357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present Toloka Visual Question Answering, a new
crowdsourced dataset allowing comparing performance of machine learning systems
against human level of expertise in the grounding visual question answering
task. In this task, given an image and a textual question, one has to draw the
bounding box around the object correctly responding to that question. Every
image-question pair contains the response, with only one correct response per
image. Our dataset contains 45,199 pairs of images and questions in English,
provided with ground truth bounding boxes, split into train and two test
subsets. Besides describing the dataset and releasing it under a CC BY license,
we conducted a series of experiments on open source zero-shot baseline models
and organized a multi-phase competition at WSDM Cup that attracted 48
participants worldwide. However, by the time of paper submission, no machine
learning model outperformed the non-expert crowdsourcing baseline according to
the intersection over union evaluation score.
Related papers
- Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge [9.915564470970049]
We present our solution for the WSDM2023 Toloka Visual Question Answering Challenge.
Inspired by the application of multimodal pre-trained models, we designed a three-stage solution.
Our team achieved a score of 76.342 on the final leaderboard, ranking second.
arXiv Detail & Related papers (2024-07-05T04:56:05Z) - Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding.
We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data.
We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test
Questions [18.186909839033017]
Test questions (TQ) are usually heterogeneous and multi-modal, e.g., some of them may only contain text, while others half contain images with information beyond their literal description.
In this paper, we first improve previous text-only representation with a two-stage unsupervised instance level contrastive based pre-training method.
Then, TQ-Net was proposed to fuse the content of images to the representation of heterogeneous data.
arXiv Detail & Related papers (2023-03-09T10:55:48Z) - Grounding Answers for Visual Questions Asked by Visually Impaired People [16.978747012406266]
VizWiz-VQA-Grounding is the first dataset that visually grounds answers to visual questions asked by people with visual impairments.
We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different.
arXiv Detail & Related papers (2022-02-04T06:47:16Z) - The Met Dataset: Instance-level Recognition for Artworks [19.43143591288768]
This work introduces a dataset for large-scale instance-level recognition in the domain of artworks.
We rely on the open access collection of The Met museum to form a large training set of about 224k classes.
arXiv Detail & Related papers (2022-02-03T18:13:30Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
Knowledge Extraction and Grounding [131.8797942031366]
We present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text.
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
We introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task.
arXiv Detail & Related papers (2021-12-20T18:23:30Z) - Learning Compositional Representation for Few-shot Visual Question
Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z) - Visual Question Answering on Image Sets [70.4472272672716]
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings.
Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images.
The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set.
arXiv Detail & Related papers (2020-08-27T08:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.