Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
- URL: http://arxiv.org/abs/2407.04255v1
- Date: Fri, 5 Jul 2024 04:56:05 GMT
- Title: Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
- Authors: Xiangyu Wu, Zhouyang Chi, Yang Yang, Jianfeng Lu,
- Abstract summary: We present our solution for the WSDM2023 Toloka Visual Question Answering Challenge.
Inspired by the application of multimodal pre-trained models, we designed a three-stage solution.
Our team achieved a score of 76.342 on the final leaderboard, ranking second.
- Score: 9.915564470970049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model's prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.
Related papers
- Pseudo-triplet Guided Few-shot Composed Image Retrieval [20.130745490934597]
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query.
We propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR.
Our scheme is plug-and-play and compatible with any existing supervised CIR models.
arXiv Detail & Related papers (2024-07-08T14:53:07Z) - Toloka Visual Question Answering Benchmark [7.71562336736357]
Toloka Visual Question Answering is a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task.
Our dataset contains 45,199 pairs of images and questions in English, provided with ground truth bounding boxes, split into train and two test subsets.
arXiv Detail & Related papers (2023-09-28T15:18:35Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Weakly-supervised 3D Pose Transfer with Keypoints [57.66991032263699]
Main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies.
We propose a novel weakly-supervised keypoint-based framework to overcome these difficulties.
arXiv Detail & Related papers (2023-07-25T12:40:24Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
Matching [53.27673119360868]
Referring expression grounding is an important and challenging task in computer vision.
We propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges.
Our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.
arXiv Detail & Related papers (2022-01-18T01:13:19Z) - A Better Loss for Visual-Textual Grounding [74.81353762517979]
Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence.
It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution.
We propose a model that is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function.
arXiv Detail & Related papers (2021-08-11T16:26:54Z) - An Empirical Study of Vehicle Re-Identification on the AI City Challenge [19.13038665501964]
The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data.
We mainly focus on four points, i.e. training data, unsupervised domain-adaptive (UDA) training, post-processing, model ensembling in this challenge.
With aforementioned techniques, our method finally achieves 0.7445 mAP score, yielding the first place in the competition.
arXiv Detail & Related papers (2021-05-20T12:20:52Z) - Dealing with Missing Modalities in the Visual Question Answer-Difference
Prediction Task through Knowledge Distillation [75.1682163844354]
We address the issues of missing modalities that have arisen from the Visual Question Answer-Difference prediction task.
We introduce a model, the "Big" Teacher, that takes the image/question/answer triplet as its input and outperforms the baseline.
arXiv Detail & Related papers (2021-04-13T06:41:11Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.