Related papers: Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

URL: http://arxiv.org/abs/2407.11784v1
Date: Tue, 16 Jul 2024 14:40:07 GMT
Title: Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development
Authors: Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou,
Abstract summary: We present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
Score: 67.55944651679864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed "Probe-Analyze-Refine" workflow, validated through applications on state-of-the-art LLaVA-like and DiT based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. With the hope of fostering deeper understanding and future progress in multi-modal data and generative modeling, our codes, datasets, and models are maintained and accessible at https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md.

Related papers

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models [24.73190742678142]
We study the risk of generative model collapse in multi-modal vision-language generative systems.<n>We find that model collapse exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in image-captioning task.<n>Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems.
arXiv Detail & Related papers (2025-05-10T22:42:29Z)
Improved visual-information-driven model for crowd simulation and its modular application [4.683197108420276]
Data-driven crowd simulation models offer advantages in enhancing the accuracy and realism of simulations. It is still an open question to develop data-driven crowd simulation models with strong generalizability. This paper proposes a data-driven model incorporating a refined visual information extraction method and exit cues to enhance generalizability.
arXiv Detail & Related papers (2025-04-02T07:53:33Z)
LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence [10.203492575046015]
LeForecast is an enterprise intelligence platform tailored for time series tasks. It integrates advanced interpretations of time series data and multi-source information, and a three-pillar modelling engine. This work reviews deployment of LeForecast and its performance in three industrial use cases.
arXiv Detail & Related papers (2025-03-27T02:58:06Z)
Active Learning of Model Discrepancy with Bayesian Experimental Design [0.0]
We propose an efficient approach to learn the model discrepancy based on the data from a sequential experimental design (BED) We show that the proposed method is efficient and robust to the active learning of high-dimensional model discrepancy, using data suggested by the sequential BED. We also demonstrate that the proposed method is compatible with both classical numerical solvers and modern auto-differentiable solvers.
arXiv Detail & Related papers (2025-02-07T22:54:20Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models. Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning. We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z)
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models [36.576853882830896]
We introduce EvolveDirector to train a text-to-image generation model comparable to advanced models using publicly available resources. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. We leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model.
arXiv Detail & Related papers (2024-10-09T17:52:28Z)
Knowledge Fusion By Evolving Weights of Language Models [5.354527640064584]
This paper examines the approach of integrating multiple models into a unified model. We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms.
arXiv Detail & Related papers (2024-06-18T02:12:34Z)
Recency-Weighted Temporally-Segmented Ensemble for Time-Series Modeling [0.0]
Time-series modeling in process industries faces the challenge of dealing with complex, multi-faceted, and evolving data characteristics. We introduce the Recency-Weighted Temporally-Segmented (ReWTS) ensemble model, a novel chunk-based approach for multi-step forecasting. We present a comparative analysis, utilizing two years of data from a wastewater treatment plant and a drinking water treatment plant in Norway.
arXiv Detail & Related papers (2024-03-04T16:00:35Z)
A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting [3.5066463427087777]
We introduce two lightweight models to enhance the versatility of crowd-counting models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly.
arXiv Detail & Related papers (2024-01-11T15:13:31Z)
Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other. We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Multi-Modal Experience Inspired AI Creation [33.34566822058209]
We study how to generate texts based on sequential multi-modal information. We firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. We then propose a curriculum negative sampling strategy tailored for the sequential inputs.
arXiv Detail & Related papers (2022-09-02T11:50:41Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.