Related papers: Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

URL: http://arxiv.org/abs/2511.18378v1
Date: Sun, 23 Nov 2025 09:56:24 GMT
Title: Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Authors: Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang, Jiarui Jin, Yuan Lu, Hanqian Wu, Cunjian Chen,
Abstract summary: We propose a novel compositional curriculum reinforcement learning framework named CompGen.<n>We leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm.<n>Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies.
Score: 8.547259329102227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

Related papers

LVLM-Composer's Explicit Planning for Image Generation [0.0]
We introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis.<n>Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation.<n>Experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions.
arXiv Detail & Related papers (2025-07-05T20:21:03Z)
AdaptGOT: A Pre-trained Model for Adaptive Contextual POI Representation Learning [7.277204616781735]
We propose the AdaptGOT model, which integrates theAdaptive representation learning technique and the Geographical-Co-Occurrence-Text representation.<n>The AdaptGOT model comprises three key components: (1) contextual neighborhood generation, which integrates advanced mixed sampling techniques such as KNN, density-based, importance-based, and category-aware strategies to capture complex contextual neighborhoods; (2) an advanced GOT representation enhanced by an attention mechanism, designed to derive high-quality, customized representations and efficiently capture complex interrelations between POIs; and (3) the MoE-based adaptive encoder-decoder architecture, which ensures topological consistency and enriches contextual representation by
arXiv Detail & Related papers (2025-06-21T08:06:06Z)
CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback [58.27353205269664]
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts.<n>However, they struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations.<n>We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships.
arXiv Detail & Related papers (2025-05-16T12:23:58Z)
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.<n>We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.<n>IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z)
Dual Advancement of Representation Learning and Clustering for Sparse and Noisy Images [14.836487514037994]
Sparse and noisy images (SNIs) pose significant challenges for effective representation learning and clustering. We propose Dual Advancement of Representation Learning and Clustering (DARLC) to enhance the representations derived from masked image modeling. Our framework offers a comprehensive approach that improves the learning of representations by enhancing their local perceptibility, distinctiveness, and the understanding of relational semantics.
arXiv Detail & Related papers (2024-09-03T10:52:27Z)
Multi-Task Curriculum Graph Contrastive Learning with Clustering Entropy Guidance [25.5510013711661]
We propose the Clustering-guided Curriculum Graph contrastive Learning (CCGL) framework. CCGL uses clustering entropy as the guidance of the following graph augmentation and contrastive learning. Experimental results demonstrate that CCGL has achieved excellent performance compared to state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-22T02:18:47Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph. We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z)
Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results. We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z)
Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation. We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.