Related papers: World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

URL: http://arxiv.org/abs/2409.20424v1
Date: Mon, 30 Sep 2024 15:49:54 GMT
Title: World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Authors: Jiacong Wang, Bohong Wu, Haiyong Jiang, Xun Zhou, Xin Xiao, Haoyuan Guo, Jun Xiao,
Abstract summary: We present World to Code (W2C), a meticulously curated multi-modal data construction pipeline. The pipeline organizes the final generation output into a Python code format. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks.
Score: 16.03491048830499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability. Our code is available at https://github.com/foundation-multimodal-models/World2Code.

Related papers

Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality [5.750869893508341]
Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning.<n>We introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset.<n>This model effectively evaluates and filters potential training samples based on caption and image quality and alignment.
arXiv Detail & Related papers (2025-07-27T07:20:25Z)
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models [63.27511432647797]
Vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.<n>Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.
arXiv Detail & Related papers (2025-06-18T17:59:49Z)
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning [71.98260064022452]
We introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models.<n>Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and connector that projects visual features into the language embedding space.
arXiv Detail & Related papers (2025-05-22T17:23:26Z)
LaViDa: A Large Diffusion Language Model for Multimodal Understanding [70.99233885354028]
LaViDa is a family of Vision-Language Models built on Discrete diffusion models.<n>DMs offer parallel decoding for faster inference and bidirectional context for controllable generation.<n>LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks.
arXiv Detail & Related papers (2025-05-22T16:07:12Z)
GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models [20.976319536167512]
We aim to establish an effective solution that enhances long context performance of Visual Language Models. We propose Giraffe, which is effectively extended to 128K lengths. We will open-source the code, data, and models.
arXiv Detail & Related papers (2024-12-17T09:57:21Z)
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding [40.784423313750075]
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios. We propose a novel positional encoding approach that employs variable increments for visual tokens, enabling more efficient management of long multimodal sequences. We show that the fine-tuned model achieves strong performance on both standard and long-context multimodal tasks.
arXiv Detail & Related papers (2024-12-12T18:59:46Z)
Video Instruction Tuning With Synthetic Data [84.64519990333406]
We create a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
arXiv Detail & Related papers (2024-10-03T17:36:49Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
Ovis: Structural Embedding Alignment for Multimodal Large Language Model [41.32013722697081]
Ovis is a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs.
arXiv Detail & Related papers (2024-05-31T13:59:18Z)
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z)
Progressive Multi-modal Conditional Prompt Tuning [92.50645776024624]
Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting. We propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT) ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information.
arXiv Detail & Related papers (2024-04-18T02:40:31Z)
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters [38.41887207958015]
We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs) Our filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore.
arXiv Detail & Related papers (2024-03-05T06:05:15Z)
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs. We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z)
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF) In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
Vision-Language Instruction Tuning: A Review and Analysis [52.218690619616474]
Vision-Language Instruction Tuning (VLIT) presents more complex characteristics compared to pure text instruction tuning. We offer a detailed categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess. By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs.
arXiv Detail & Related papers (2023-11-14T14:02:32Z)
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD) Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.