Related papers: Rethinking Overlooked Aspects in Vision-Language Models

Rethinking Overlooked Aspects in Vision-Language Models

URL: http://arxiv.org/abs/2405.11850v1
Date: Mon, 20 May 2024 07:53:41 GMT
Title: Rethinking Overlooked Aspects in Vision-Language Models
Authors: Yuan Liu, Le Tian, Xiao Zhou, Jie Zhou,
Abstract summary: Recent advancements in vision-language models (LVLMs) have been substantial. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets.
Score: 32.525916879333145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets. Our research indicates that merely increasing the size of pre-training data does not guarantee improved performance and may, in fact, lead to its degradation. Furthermore, we have established a pipeline to pinpoint the most efficient instruction tuning (SFT) dataset, implying that not all SFT data utilized in existing studies are necessary. The primary objective of this paper is not to introduce a state-of-the-art model, but rather to serve as a roadmap for future research, aiming to optimize data usage during pre-training and fine-tuning processes to enhance the performance of vision-language models.

Related papers

Optimizing V-information for Self-Supervised Pre-training Data-Effective Medical Foundation Models [15.413974936297082]
Self-supervised pre-training medical foundation models on large-scale datasets demonstrate exceptional performance. Data-effective learning approaches have been introduced to select valuable samples for foundation model pre-training. We develop an optimal data-effective learning method to optimize V-information in real-world medical domains.
arXiv Detail & Related papers (2024-08-13T10:28:54Z)
Controlling Forgetting with Test-Time Data in Continual Learning [15.455400390299593]
Ongoing Continual Learning research provides techniques to overcome catastrophic forgetting of previous information when new knowledge is acquired. We argue that test-time data hold great information that can be leveraged in a self supervised manner to refresh the model's memory of previous learned tasks.
arXiv Detail & Related papers (2024-06-19T15:56:21Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets. We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT) We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z)
Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark [97.8968058408759]
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks.<n>As these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands.<n> parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters.
arXiv Detail & Related papers (2024-02-03T19:12:20Z)
Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization. We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z)
Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z)
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora. We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.