Gamified crowd-sourcing of high-quality data for visual fine-tuning
- URL: http://arxiv.org/abs/2410.04038v2
- Date: Tue, 8 Oct 2024 02:37:41 GMT
- Title: Gamified crowd-sourcing of high-quality data for visual fine-tuning
- Authors: Shashank Yadav, Rohan Tomar, Garvit Jain, Chirag Ahooja, Shubham Chaudhary, Charles Elkan,
- Abstract summary: This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models.
GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge.
- Score: 0.9487395978583629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model's knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT score from 0.147 to 0.477 on our dataset, approaching the benchmark set by the much larger GPT-4V. Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B also enhances its performance across other benchmarks, and exhibits cross-model benefits. Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks.
Related papers
- DAMA: Data- and Model-aware Alignment of Multi-modal LLMs [31.116618294885065]
We propose Data- and Model-aware DPO (DAMA) to adjust the optimization process from two key aspects.
By combining the two strategies, DAMA enables the model to effectively adapt to data with varying levels of hardness.
arXiv Detail & Related papers (2025-02-04T02:30:36Z) - VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.
Current assessment methods rely on AI-annotated preference labels from traditional tasks.
We introduce VL-RewardBench, a benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Improving Data Efficiency via Curating LLM-Driven Rating Systems [30.233724785974143]
We introduce DS2, a Diversity-aware Score curation method for Data Selection.
By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples.
Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks.
arXiv Detail & Related papers (2024-10-09T10:07:55Z) - MAmmoTH2: Scaling Instructions from the Web [39.786198452175505]
We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus.
We build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks.
Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-06T15:11:38Z) - TextSquare: Scaling up Text-Centric Visual Instruction Tuning [64.55339431760727]
We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M.
Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs.
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
arXiv Detail & Related papers (2024-04-19T11:38:08Z) - Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation [56.13803674092712]
We propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR)
CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering.
In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations.
arXiv Detail & Related papers (2024-02-28T09:27:29Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Complementary Ensemble Learning [1.90365714903665]
We derive a technique to improve performance of state-of-the-art deep learning models.
Specifically, we train auxiliary models which are able to complement state-of-the-art model uncertainty.
arXiv Detail & Related papers (2021-11-09T03:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.