TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
- URL: http://arxiv.org/abs/2502.09925v1
- Date: Fri, 14 Feb 2025 05:32:46 GMT
- Title: TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
- Authors: Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang,
- Abstract summary: Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data.<n>Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling.<n>We propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples.
- Score: 8.755996117965571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.
Related papers
- TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training [29.962039479618543]
We introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training.<n> TADS integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function.<n>A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance.
arXiv Detail & Related papers (2026-02-05T03:08:45Z) - OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing [45.539561363519844]
We introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology.<n>We generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks.<n>Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
arXiv Detail & Related papers (2025-09-29T15:11:09Z) - MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation [31.21163360113923]
MM-Gen is a scalable method that generates task-specific, high-quality synthetic text for candidate images.<n>Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains.<n>Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements.
arXiv Detail & Related papers (2025-01-07T21:55:56Z) - Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance.
Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly.
We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation [80.18830380517753]
We develop a new task-agnostic distillation framework XtremeDistilTransformers.
We study the transferability of several source tasks, augmentation resources and model architecture for distillation.
arXiv Detail & Related papers (2021-06-08T17:49:33Z) - HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable
Hyper Projections [96.64246471034195]
We propose textscHyperGrid, a new approach for highly effective multi-task learning.
Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
arXiv Detail & Related papers (2020-07-12T02:49:16Z) - Using a thousand optimization tasks to learn hyperparameter search
strategies [53.318615663332274]
We present TaskSet, a dataset of neural tasks for use in training and evaluating neurals.
TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets.
arXiv Detail & Related papers (2020-02-27T02:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.