C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
- URL: http://arxiv.org/abs/2507.16518v2
- Date: Tue, 29 Jul 2025 07:40:20 GMT
- Title: C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
- Authors: Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang,
- Abstract summary: C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
- Score: 78.36259648527401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
Related papers
- HuggingGraph: Understanding the Supply Chain of LLM Ecosystem [9.61483474473764]
Large language models (LLMs) leverage deep learning to process and predict sequences of words from context.<n>As a result, platforms that host models and datasets are widely used.<n>This project aims to study the relationships between models and datasets, which are core components of the LLM supply chain.
arXiv Detail & Related papers (2025-07-17T17:34:13Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Dynamic Hybrid Modeling: Incremental Identification and Model Predictive Control [0.6775616141339018]
The identification of dynamic hybrid models remains difficult due to the need to integrate data-driven models within mechanistic model structures.<n>We present an incremental identification approach for dynamic hybrid models that decouples the mechanistic and data-driven components.<n>This approach facilitates early evaluation of model structure suitability, accelerates the development of hybrid models, and allows for independent identification of data-driven components.
arXiv Detail & Related papers (2025-06-23T06:55:32Z) - Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models [24.73190742678142]
We study the risk of generative model collapse in multi-modal vision-language generative systems.<n>We find that model collapse exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in image-captioning task.<n>Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems.
arXiv Detail & Related papers (2025-05-10T22:42:29Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - An Active Learning Framework for Inclusive Generation by Large Language Models [32.16984263644299]
Large Language Models (LLMs) generate text representative of diverse sub-populations.<n>We propose a novel clustering-based active learning framework, enhanced with knowledge distillation.<n>We construct two new datasets in tandem with model training, showing a performance improvement of 2%-10% over baseline models.
arXiv Detail & Related papers (2024-10-17T15:09:35Z) - Progressively Label Enhancement for Large Language Model Alignment [42.01694160556464]
Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations.
We propose PLE, a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data.
arXiv Detail & Related papers (2024-08-05T16:21:17Z) - Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.<n>This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs [6.879945062426145]
We introduce SK-VQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs.<n>Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance.<n>Our results indicate that models trained on SK-VQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
arXiv Detail & Related papers (2024-06-28T01:14:43Z) - Model-agnostic multi-objective approach for the evolutionary discovery
of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results.
We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.