Related papers: Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

URL: http://arxiv.org/abs/2511.05057v1
Date: Fri, 07 Nov 2025 08:03:53 GMT
Title: Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach
Authors: Yuanxiang Huangfu, Chaochao Wang, Weilei Wang,
Abstract summary: Role- SynthCLIP is a novel data synthesis framework that leverages multi-perspective role-playing prompts.<n>It enhances semantic diversity and fine-grained image-text alignment of synthetic pairs.<n>A CLIP-B/16 model trained on only 1 million Role- SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set.
Score: 0.7446442872036001
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.

Related papers

MultiModal Fine-tuning with Synthetic Captions [9.572235167281686]
We propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs)<n>Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions for classification tasks.<n>Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning.
arXiv Detail & Related papers (2026-01-29T09:03:45Z)
Unified Text-Image Generation with Weakness-Targeted Post-Training [57.956648078400775]
Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis.<n>This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis.
arXiv Detail & Related papers (2026-01-07T19:19:44Z)
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a variety of benchmarks.<n>A substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning.<n>We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.<n>Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.<n>We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M.
arXiv Detail & Related papers (2025-02-18T03:58:38Z)
Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.<n>Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z)
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance. It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Robust Cross-Modal Representation Learning with Progressive Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets. We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.