Related papers: TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

URL: http://arxiv.org/abs/2502.09925v1
Date: Fri, 14 Feb 2025 05:32:46 GMT
Title: TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
Authors: Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang,
Abstract summary: Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data.<n>Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling.<n>We propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples.
Score: 8.755996117965571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.

Related papers

TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training [29.962039479618543]
We introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training.<n> TADS integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function.<n>A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance.
arXiv Detail & Related papers (2026-02-05T03:08:45Z)
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing [45.539561363519844]
We introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology.<n>We generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks.<n>Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
arXiv Detail & Related papers (2025-09-29T15:11:09Z)
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation [31.21163360113923]
MM-Gen is a scalable method that generates task-specific, high-quality synthetic text for candidate images.<n>Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains.<n>Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements.
arXiv Detail & Related papers (2025-01-07T21:55:56Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z)
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers. Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters. We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z)
XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation [80.18830380517753]
We develop a new task-agnostic distillation framework XtremeDistilTransformers. We study the transferability of several source tasks, augmentation resources and model architecture for distillation.
arXiv Detail & Related papers (2021-06-08T17:49:33Z)
HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections [96.64246471034195]
We propose textscHyperGrid, a new approach for highly effective multi-task learning. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
arXiv Detail & Related papers (2020-07-12T02:49:16Z)
Using a thousand optimization tasks to learn hyperparameter search strategies [53.318615663332274]
We present TaskSet, a dataset of neural tasks for use in training and evaluating neurals. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets.
arXiv Detail & Related papers (2020-02-27T02:49:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.