Related papers: Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

URL: http://arxiv.org/abs/2508.10339v1
Date: Thu, 14 Aug 2025 04:48:38 GMT
Title: Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models
Authors: Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh,
Abstract summary: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills.<n>Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark.
Score: 54.829219574424634
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

Related papers

Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment [2.3735961220736423]
We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning.<n>Our approach demonstrates significant zero-shot performance improvements without task-specific fine-tuning.<n>Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.
arXiv Detail & Related papers (2025-05-20T11:04:14Z)
A Benchmark for Fairness-Aware Graph Learning [58.515305543487386]
We present an extensive benchmark on ten representative fairness-aware graph learning methods. Our in-depth analysis reveals key insights into the strengths and limitations of existing methods.
arXiv Detail & Related papers (2024-07-16T18:43:43Z)
Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning [32.18543787821028]
This paper proposes an adaptive technique of batch fusion for self-supervised contrastive learning. It achieves state-of-the-art performance under equitable comparisons. We suggest that the proposed method may contribute to the advancement of data-driven self-supervised learning research.
arXiv Detail & Related papers (2023-11-16T15:47:49Z)
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z)
Partner-Assisted Learning for Few-Shot Image Classification [54.66864961784989]
Few-shot Learning has been studied to mimic human visual capabilities and learn effective models without the need of exhaustive human annotation. In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples. We propose a two-stage training scheme, which first trains a partner encoder to model pair-wise similarities and extract features serving as soft-anchors, and then trains a main encoder by aligning its outputs with soft-anchors while attempting to maximize classification performance.
arXiv Detail & Related papers (2021-09-15T22:46:19Z)
Concept Generalization in Visual Representation Learning [39.32868843527767]
We argue that semantic relationships between seen and unseen concepts affect generalization performance. We propose ImageNet-CoG, a novel benchmark on the ImageNet dataset that enables measuring concept generalization in a principled way.
arXiv Detail & Related papers (2020-12-10T13:13:22Z)
A Competence-aware Curriculum for Visual Concepts Learning via Question Answering [95.35905804211698]
We propose a competence-aware curriculum for visual concept learning in a question-answering manner. We design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the learning process. Experimental results on CLEVR show that with a competence-aware curriculum, the proposed method achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-07-03T05:08:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.