VIBE: Visual Instruction Based Editor
- URL: http://arxiv.org/abs/2601.02242v1
- Date: Mon, 05 Jan 2026 16:17:20 GMT
- Title: VIBE: Visual Instruction Based Editor
- Authors: Grigorii Alekseenko, Aleksandr Gordeev, Irina Tolstykh, Bulat Suleimanov, Vladimir Dokholyan, Georgii Fedorov, Sergey Yakubson, Aleksandra Tsybina, Mikhail Chernyshov, Maksim Kuprashevich,
- Abstract summary: This paper presents a compact, high- throughput instruction-based image editing pipeline.<n>The pipeline is evaluated on the ImgEdit and GEdit benchmarks.<n>It generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.
- Score: 60.21587335143115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.
Related papers
- FireRed-Image-Edit-1.0 Techinical Report [30.973736748818826]
FireRed--Edit is a diffusion transformer for instruction image editing.<n>It achieves state-of-the-art performance through systematic optimization of data, training methodology, and evaluation design.
arXiv Detail & Related papers (2026-02-12T17:51:44Z) - I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models [78.62380562116135]
Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
arXiv Detail & Related papers (2025-12-04T10:44:07Z) - Taming Flow-based I2V Models for Creative Video Editing [64.67801702413122]
Video editing, which aims to manipulate videos according to user intent, remains an emerging challenge.<n>Most existing image-conditioned video editing methods require inversion with model-specific design or need extensive optimization.<n>We propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead.
arXiv Detail & Related papers (2025-09-26T05:57:04Z) - ImgEdit: A Unified Image Editing Dataset and Benchmark [14.185771939071149]
We introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.<n>ImgEdit surpasses existing datasets in both task novelty and data quality.<n>For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance.
arXiv Detail & Related papers (2025-05-26T17:53:33Z) - Step1X-Edit: A Practical Framework for General Image Editing [64.07202539610576]
We release a state-of-the-art image editing model, called Step1X-Edit.<n>It can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash.<n>For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions.
arXiv Detail & Related papers (2025-04-24T17:25:12Z) - Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing [42.73883397041092]
We propose a novel approach that is built upon a modified diffusion sampling process via the guidance mechanism.
In this work, we explore the self-guidance technique to preserve the overall structure of the input image.
We show through human evaluation and quantitative analysis that the proposed method allows to produce desired editing.
arXiv Detail & Related papers (2024-09-02T15:21:46Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing [38.13162627140172]
HQ-Edit is a high-quality instruction-based image editing dataset with around 200,000 edits.
To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs.
HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models.
arXiv Detail & Related papers (2024-04-15T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.