FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks
- URL: http://arxiv.org/abs/2303.02483v1
- Date: Sat, 4 Mar 2023 19:07:48 GMT
- Title: FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks
- Authors: Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang
- Abstract summary: We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
- Score: 129.49630356651454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the fashion domain, there exists a variety of vision-and-language (V+L)
tasks, including cross-modal retrieval, text-guided image retrieval,
multi-modal classification, and image captioning. They differ drastically in
each individual input/output format and dataset size. It has been common to
design a task-specific model and fine-tune it independently from a pre-trained
V+L model (e.g., CLIP). This results in parameter inefficiency and inability to
exploit inter-task relatedness. To address such issues, we propose a novel
FAshion-focused Multi-task Efficient learning method for Vision-and-Language
tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL
applies a single model for multiple heterogeneous fashion tasks, therefore
being much more parameter-efficient. It is enabled by two novel components: (1)
a task-versatile architecture with cross-attention adapters and task-specific
adapters integrated into a unified V+L model, and (2) a stable and effective
multi-task training strategy that supports learning from heterogeneous data and
prevents negative transfer. Extensive experiments on four fashion tasks show
that our FAME-ViL can save 61.5% of parameters over alternatives, while
significantly outperforming the conventional independently trained single-task
models. Code is available at https://github.com/BrandonHanx/FAME-ViL.
Related papers
- Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Foundation Model is Efficient Multimodal Multitask Model Selector [47.017463595702274]
A brute-force approach is to finetune all models on all target datasets, bringing high computational costs.
We propose an efficient multi-task model selector (EMMS) to transform diverse label formats into a unified noisy label embedding.
EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario.
arXiv Detail & Related papers (2023-08-11T17:54:44Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts [75.75548749888029]
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks.
With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
arXiv Detail & Related papers (2023-05-11T17:57:49Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.