Related papers: $VILA^2$: VILA Augmented VILA

$VILA^2$: VILA Augmented VILA

URL: http://arxiv.org/abs/2407.17453v1
Date: Wed, 24 Jul 2024 17:37:05 GMT
Title: $VILA^2$: VILA Augmented VILA
Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin,
Abstract summary: We introduce a novel approach that includes a self-augment step and a specialist-augment step to improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. With the combined self-augmented and specialist-augmented training, we introduce $VILA2$ (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks.
Score: 39.7645911507078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce $VILA^2$ (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, and achieves new state-of-the-art results on MMMU leaderboard among open-sourced models.

Related papers

Unlearning the Noisy Correspondence Makes CLIP More Robust [13.619912296579402]
Noisy Correspondence (NC) samples significantly impair the performance of Vision-Language Models (VLMs)<n>We propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge.<n>We validate our approach with the prevailing CLIP model over various downstream tasks.
arXiv Detail & Related papers (2025-07-04T09:45:11Z)
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models [52.343627275005026]
GIFT is a novel continual fine-tuning approach to overcome catastrophic forgetting in Vision-Language Models. We employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. Our method consistently outperforms previous state-of-the-art approaches across various settings.
arXiv Detail & Related papers (2025-03-06T09:09:18Z)
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception [20.01853641155509]
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. We propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent.
arXiv Detail & Related papers (2025-01-31T04:30:42Z)
Language Models as Continuous Self-Evolving Data Engineers [32.67875951851165]
Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks. Traditional training approaches rely too much on expert-labeled data. We propose a novel paradigm named LANCE that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data.
arXiv Detail & Related papers (2024-12-19T18:28:41Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models [31.08312208507481]
Turbo is a plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.
arXiv Detail & Related papers (2024-07-16T13:35:26Z)
Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images. We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE) MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models [25.856254802834375]
This paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs.
arXiv Detail & Related papers (2023-12-12T16:27:35Z)
Rethinking the Instruction Quality: LIFT is What You Need [20.829372251475476]
Existing quality improvement methods alter instruction data through dataset expansion or curation. We propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights. Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs consistently uphold robust performance across various tasks.
arXiv Detail & Related papers (2023-12-12T03:30:21Z)
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora. We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
On Automatic Data Augmentation for 3D Point Cloud Classification [19.338266486983176]
We propose to automatically learn a data augmentation strategy using bilevel optimization. An augmentor is designed in a similar fashion to a conditional generator and is optimized by minimizing a base model's loss on a validation set. We evaluate our approach on standard point cloud classification tasks and a more challenging setting with pose misalignment between training and validation/test sets.
arXiv Detail & Related papers (2021-12-11T17:14:16Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.