Related papers: UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

URL: http://arxiv.org/abs/2512.02790v1
Date: Mon, 01 Dec 2025 17:45:44 GMT
Title: UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Authors: Keming Ye, Zhipeng Huang, Canmiao Fu, Qingyang Liu, Jiani Cai, Zheqi Lv, Chen Li, Jing Lyu, Zhou Zhao, Shengyu Zhang,
Abstract summary: We introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage.<n>For scalable quality control, we train a 7B dual-task expert model, textbfQwen-Verify, for efficient failure detection and instruction recaptioning.<n>This pipeline yields textbfUnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks.
Score: 43.59555184340113
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

Related papers

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition [12.731093427395985]
ADAMAB is an efficient embedding calibration framework for few-shot pattern recognition.<n>Our experiments justify the superior performance of ADAMAB, with up to 40% accuracy improvement.
arXiv Detail & Related papers (2026-02-22T23:39:21Z)
FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models [17.64873155970997]
We present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication.<n>We fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication.<n>In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations.
arXiv Detail & Related papers (2025-11-14T17:25:14Z)
ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset [43.45582911794623]
We introduce ToolMind, a high-quality tool-agentic dataset with 160k synthetic data instances.<n>We employ fine-grained turn-level filtering to remove erroneous or suboptimal steps.<n>Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
arXiv Detail & Related papers (2025-11-12T13:01:23Z)
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge [7.655956608192742]
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA.<n>It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy.<n>On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models.
arXiv Detail & Related papers (2025-09-09T17:53:58Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors [27.848233831749216]
textbfWUDI-Merging (textbfWhoever started the interference shotextbfUld entextbfD textbfIt) is a model merging method that eliminates interference without any additional data or rescaling coefficients.<n> Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method's superiority.
arXiv Detail & Related papers (2025-03-11T07:01:35Z)
Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation [70.95380821618711]
Dichotomous Image (DIS) tasks require highly precise annotations.<n>Current generative models and techniques struggle with the issues of scene deviations, noise-induced errors, and limited training sample variability.<n>We introduce a novel approach, which provides a scalable solution for generating diverse and precise datasets.
arXiv Detail & Related papers (2024-12-26T06:37:25Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
G-SPEED: General SParse Efficient Editing MoDel [25.48360227520061]
underlinetextbfGeneral underlinetextbfSParse underlinetextbfEfficient underlinetextbfEditing MounderlinetextbfDel(textbfG-SPEED)
arXiv Detail & Related papers (2023-10-16T15:01:18Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference. We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space. Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.