Related papers: Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023

Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023

URL: http://arxiv.org/abs/2310.06440v1
Date: Tue, 10 Oct 2023 09:12:27 GMT
Title: Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023
Authors: Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, Qingguo Chen, Jianfeng Lu
Abstract summary: We present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. This challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles. Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
Score: 13.326745559876558
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. Different from the traditional visual question-answering datasets, this challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles designed specifically for children in the 6-8 age group. We employed a divide-and-conquer approach. At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner. Additionally, we trained a yolov7 model on the icon45 dataset for object detection and combined it with the OCR method to recognize and locate objects and text within the images. At the model level, we utilized the BLIP-2 model and added eight adapters to the image encoder VIT-G to adaptively extract visual features for different question types. We fed the pre-constructed question templates as input and generated answers using the flan-t5-xxl decoder. Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.

Related papers

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024 [8.588965648810483]
This paper presents the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge. The challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.
arXiv Detail & Related papers (2024-06-10T01:45:55Z)
Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning [7.84845040922464]
We present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network. Our model is based on two pre-trained models, dedicated to extract features from text and image respectively.
arXiv Detail & Related papers (2024-06-08T01:45:06Z)
A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track [31.754017006309564]
We propose a framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks. The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score.
arXiv Detail & Related papers (2024-02-27T08:51:20Z)
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations. We introduce ORION, a collection of structured retrieval tasks spanning six domains. We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Are Deep Neural Networks SMARTer than Second Graders? [85.60342335636341]
We evaluate the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning. Experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization.
arXiv Detail & Related papers (2022-12-20T04:33:32Z)
Exploring Multi-Modal Representations for Ambiguity Detection & Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI. Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments. Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z)
Rotation Invariance and Extensive Data Augmentation: a strategy for the Mitosis Domain Generalization (MIDOG) Challenge [1.52292571922932]
We present the strategy we applied to participate in the MIDOG 2021 competition. The purpose of the competition was to evaluate the generalization of solutions to images acquired with unseen target scanners. We propose a solution based on a combination of state-of-the-art deep learning methods.
arXiv Detail & Related papers (2021-09-02T10:09:02Z)
Question Answering Infused Pre-training of General-Purpose Contextualized Representations [70.62967781515127]
We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations. We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection.
arXiv Detail & Related papers (2021-06-15T14:45:15Z)
Self-supervised Learning with Fully Convolutional Networks [24.660086792201263]
We focus on the problem of learning representation from unlabeled data for semantic segmentation. Inspired by two patch-based methods, we develop a novel self-supervised learning framework. We achieve a 5.8 percentage point improvement over the baseline model.
arXiv Detail & Related papers (2020-12-18T02:31:28Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.