Related papers: FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

URL: http://arxiv.org/abs/2506.01520v1
Date: Mon, 02 Jun 2025 10:34:57 GMT
Title: FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents
Authors: Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, Wynne Hsu,
Abstract summary: Current online form filling tools are largely rule-based and lack generalizable, generative capabilities.<n>We propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and dataset.<n>Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions.
Score: 36.11725924594441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.

Related papers

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings [9.344107676552408]
We propose UniMoCo, a vision-language model architecture designed for multi-modal embedding tasks.<n>We develop a specialized training strategy to align embeddings from both original and modality-completed inputs.<n>Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings.
arXiv Detail & Related papers (2025-05-17T03:53:11Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition [0.27309692684728604]
MVIP is a novel dataset for multi-modal and multi-view application-oriented industrial part recognition.<n>Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks.
arXiv Detail & Related papers (2025-02-21T13:22:29Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy [2.294223504228228]
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems.<n>Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning.<n>Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview.
arXiv Detail & Related papers (2024-12-23T18:15:19Z)
GUI Agents with Foundation Models: A Comprehensive Survey [91.97447457550703]
This survey consolidates recent research on (M)LLM-based GUI agents.<n>We identify key challenges and propose future research directions.<n>We hope this survey will inspire further advancements in the field of (M)LLM-based GUI agents.
arXiv Detail & Related papers (2024-11-07T17:28:10Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks [49.59854479079552]
MEGA-Bench is an evaluation suite that scales multimodal evaluation to over 500 real-world tasks.<n>We collected 505 tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space.
arXiv Detail & Related papers (2024-10-14T14:42:12Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.<n>Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.<n>We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
Semantic Constraint Inference for Web Form Test Generation [6.0759036120654315]
We introduce an innovative approach, called FormNexus, for automated web form test generation. FormNexus emphasizes deriving semantic insights from individual form elements and relations among them. We show that FormNexus combined with GPT-4 achieves 89% coverage in form submission states.
arXiv Detail & Related papers (2024-02-01T19:10:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.