VILA$^2$: VILA Augmented VILA
- URL: http://arxiv.org/abs/2407.17453v2
- Date: Thu, 31 Oct 2024 23:23:22 GMT
- Title: VILA$^2$: VILA Augmented VILA
- Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jan Kautz, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin,
- Abstract summary: We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step.
We observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds.
We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage.
- Score: 69.5318347688297
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.
Related papers
- Synthetic Data is an Elegant GIFT for Continual Vision-Language Models [52.343627275005026]
GIFT is a novel continual fine-tuning approach to overcome catastrophic forgetting in Vision-Language Models.
We employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data.
Our method consistently outperforms previous state-of-the-art approaches across various settings.
arXiv Detail & Related papers (2025-03-06T09:09:18Z) - RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception [20.01853641155509]
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems.
We propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent.
arXiv Detail & Related papers (2025-01-31T04:30:42Z) - Language Models as Continuous Self-Evolving Data Engineers [32.67875951851165]
Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks.
Traditional training approaches rely too much on expert-labeled data.
We propose a novel paradigm named LANCE that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data.
arXiv Detail & Related papers (2024-12-19T18:28:41Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models [31.08312208507481]
Turbo is a plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs.
On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.
arXiv Detail & Related papers (2024-07-16T13:35:26Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language
Models [25.856254802834375]
This paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data.
Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs.
arXiv Detail & Related papers (2023-12-12T16:27:35Z) - Rethinking the Instruction Quality: LIFT is What You Need [20.829372251475476]
Existing quality improvement methods alter instruction data through dataset expansion or curation.
We propose LIFT (LLM Instruction Fusion Transfer), a novel and versatile paradigm designed to elevate the instruction quality to new heights.
Experimental results demonstrate that, even with a limited quantity of high-quality instruction data selected by our paradigm, LLMs consistently uphold robust performance across various tasks.
arXiv Detail & Related papers (2023-12-12T03:30:21Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z) - On Automatic Data Augmentation for 3D Point Cloud Classification [19.338266486983176]
We propose to automatically learn a data augmentation strategy using bilevel optimization.
An augmentor is designed in a similar fashion to a conditional generator and is optimized by minimizing a base model's loss on a validation set.
We evaluate our approach on standard point cloud classification tasks and a more challenging setting with pose misalignment between training and validation/test sets.
arXiv Detail & Related papers (2021-12-11T17:14:16Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.