Related papers: DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

URL: http://arxiv.org/abs/2410.16472v1
Date: Mon, 21 Oct 2024 19:59:04 GMT
Title: DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding
Authors: Manan Suri, Puneet Mathur, Franck Dernoncourt, Rajiv Jain, Vlad I Morariu, Ramit Sawhney, Preslav Nakov, Dinesh Manocha,
Abstract summary: We introduce DocEdit-v2, a novel framework that performs end-to-end document editing by leveraging Large Multimodal Models (LMMs) It consists of three novel components: (1) Doc2Command, which simultaneously localizes edit regions of interest (RoI) and disambiguates user edit requests into edit commands; (2) LLM-based Command Reformulation prompting to tailor edit commands originally intended for specialized software into edit instructions suitable for generalist LMMs; and (3) Moreover, DocEdit-v2 processes these outputs via Large Multimodal Models like GPT-4V and Gemini, to parse the document layout, execute edits on
Score: 128.92659116774374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user's requests. Past works have shown that multimodal grounding of user requests in the document image and identifying the accurate structural components and their associated attributes remain key challenges for this task. To address these, we introduce the DocEdit-v2, a novel framework that performs end-to-end document editing by leveraging Large Multimodal Models (LMMs). It consists of three novel components: (1) Doc2Command, which simultaneously localizes edit regions of interest (RoI) and disambiguates user edit requests into edit commands; (2) LLM-based Command Reformulation prompting to tailor edit commands originally intended for specialized software into edit instructions suitable for generalist LMMs. (3) Moreover, DocEdit-v2 processes these outputs via Large Multimodal Models like GPT-4V and Gemini, to parse the document layout, execute edits on grounded Region of Interest (RoI), and generate the edited document image. Extensive experiments on the DocEdit dataset show that DocEdit-v2 significantly outperforms strong baselines on edit command generation (2-33%), RoI bounding box detection (12-31%), and overall document editing (1-12\%) tasks.

Related papers

Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z)
Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications [9.795246551841586]
Large Language Models (LLMs) have transformed natural language processing, yet they still struggle with direct text editing tasks. In this work, we introduce a dual approach to enhance LLM editing performance. First, we present InstrEditBench, a high-quality benchmark dataset comprising over 20,000 structured editing tasks. Second, we propose FineEdit, a specialized model trained on this curated benchmark.
arXiv Detail & Related papers (2025-02-19T01:41:44Z)
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z)
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction [31.95664918050255]
FreeEdit is a novel approach for achieving reference-based image editing. It can accurately reproduce the visual concept from the reference image based on user-friendly language instructions.
arXiv Detail & Related papers (2024-09-26T17:18:39Z)
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control [21.624984690721842]
D-Edit is a framework to disentangle the comprehensive image-prompt interaction into several item-prompt interactions. It is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal.
arXiv Detail & Related papers (2024-03-07T20:06:29Z)
Object-aware Inversion and Reassembly for Image Editing [61.19822563737121]
We propose Object-aware Inversion and Reassembly (OIR) to enable object-level fine-grained editing. We use our search metric to find the optimal inversion step for each editing pair when editing an image. Our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
arXiv Detail & Related papers (2023-10-18T17:59:02Z)
CCEdit: Creative and Controllable Video Editing via Diffusion Models [58.34886244442608]
CCEdit is a versatile generative video editing framework based on diffusion models. Our approach employs a novel trident network structure that separates structure and appearance control. Our user studies compare CCEdit with eight state-of-the-art video editing methods.
arXiv Detail & Related papers (2023-09-28T15:03:44Z)
Beyond the Chat: Executable and Verifiable Text-Editing with LLMs [87.84199761550634]
Conversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. We present InkSync, an editing interface that suggests executable edits directly within the document being edited.
arXiv Detail & Related papers (2023-09-27T00:56:17Z)
CoEdIT: Text Editing by Task-Specific Instruction Tuning [18.824571167583432]
CoEdIT is a state-of-the-art text editing system for writing assistance. It takes instructions from the user specifying the attributes of the desired text, and outputs the edited text. We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing.
arXiv Detail & Related papers (2023-05-17T00:05:24Z)
Text Editing by Command [82.50904226312451]
A prevailing paradigm in neural text generation is one-shot generation, where text is produced in a single step. We address this limitation with an interactive text generation setting in which the user interacts with the system by issuing commands to edit existing text. We show that our Interactive Editor, a transformer-based model trained on this dataset, outperforms baselines and obtains positive results in both automatic and human evaluations.
arXiv Detail & Related papers (2020-10-24T08:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.