Winning the ICCV'2021 VALUE Challenge: Task-aware Ensemble and Transfer
Learning with Visual Concepts
- URL: http://arxiv.org/abs/2110.06476v1
- Date: Wed, 13 Oct 2021 03:50:07 GMT
- Title: Winning the ICCV'2021 VALUE Challenge: Task-aware Ensemble and Transfer
Learning with Visual Concepts
- Authors: Minchul Shin, Jonghwan Mun, Kyoung-Woon On, Woo-Young Kang, Gunsoo
Han, Eun-Sol Kim
- Abstract summary: The VALUE (Video-And-Language Understanding Evaluation) benchmark is newly introduced to evaluate and analyze multi-modal representation learning algorithms.
The main objective of the VALUE challenge is to train a task-agnostic model that is simultaneously applicable for various tasks with different characteristics.
This technical report describes our winning strategies for the VALUE challenge: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble.
- Score: 20.412239939287886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The VALUE (Video-And-Language Understanding Evaluation) benchmark is newly
introduced to evaluate and analyze multi-modal representation learning
algorithms on three video-and-language tasks: Retrieval, QA, and Captioning.
The main objective of the VALUE challenge is to train a task-agnostic model
that is simultaneously applicable for various tasks with different
characteristics. This technical report describes our winning strategies for the
VALUE challenge: 1) single model optimization, 2) transfer learning with visual
concepts, and 3) task-aware ensemble. The first and third strategies are
designed to address heterogeneous characteristics of each task, and the second
one is to leverage rich and fine-grained visual information. We provide a
detailed and comprehensive analysis with extensive experimental results. Based
on our approach, we ranked first place on the VALUE and QA phases for the
competition.
Related papers
- QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View [2.3982875575861677]
We present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge.
For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image.
For training, we present an action dictionary-guided design, which consistently yields the most favorable results.
arXiv Detail & Related papers (2024-07-18T06:55:26Z) - Affective Behavior Analysis using Task-adaptive and AU-assisted Graph Network [18.304164382834617]
We present our solution and experiment result for the Multi-Task Learning Challenge of the 7th Affective Behavior Analysis in-the-wild(ABAW7) Competition.
This challenge consists of three tasks: action unit detection, facial expression recognition, and valance-arousal estimation.
arXiv Detail & Related papers (2024-07-16T12:33:22Z) - Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues.
Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images.
This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture.
Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions.
This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z) - Multitask Multimodal Prompted Training for Interactive Embodied Task
Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories.
By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z) - Towards Task Sampler Learning for Meta-Learning [37.02030832662183]
Meta-learning aims to learn general knowledge with diverse training tasks conducted from limited data, and then transfer it to new tasks.
It is commonly believed that increasing task diversity will enhance the generalization ability of meta-learning models.
This paper challenges this view through empirical and theoretical analysis.
arXiv Detail & Related papers (2023-07-18T01:53:18Z) - Blind Image Quality Assessment via Vision-Language Correspondence: A
Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information.
We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.