Related papers: Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

URL: http://arxiv.org/abs/2602.03414v1
Date: Tue, 03 Feb 2026 11:42:25 GMT
Title: Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction
Authors: Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang,
Abstract summary: Socratic-Geo is a fully autonomous framework that couples data synthesis with model learning through multi-agent interaction.<n>Socratic-r achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points.<n>Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models.
Score: 11.021067780524348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).

Related papers

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning [54.42973725693]
We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model.<n>GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ and WISE.<n>Our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks.
arXiv Detail & Related papers (2026-01-26T14:49:04Z)
Iterative Refinement Improves Compositional Image Generation [47.116050084875106]
Text-to-image (T2I) models struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes.<n>We propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps.<n>Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models.
arXiv Detail & Related papers (2026-01-21T18:59:40Z)
A Multimodal, Multitask System for Generating E Commerce Text Listings from Images [0.0]
We propose an end to end, multi task system that generates factually grounded textual listings from a single image.<n>The hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%.<n>One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
arXiv Detail & Related papers (2025-10-22T11:50:49Z)
GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning [71.47606279139679]
GenView++ is a unified framework for image-based contrastive learning.<n>It introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views.<n>A quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution.<n>Experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks.
arXiv Detail & Related papers (2025-09-28T09:35:37Z)
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation [110.03631978640298]
We present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.<n>We identify three key properties that hinder the learning of high-level visual semantics.<n>We show that these issues can be effectively addressed by introducing self-supervised objectives during training.
arXiv Detail & Related papers (2025-09-18T17:47:40Z)
Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z)
TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples [34.71588837946776]
We propose CounterCurate, a framework to improve visio-linguistic compositional reasoning. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements.
arXiv Detail & Related papers (2024-02-20T18:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.