StarFlow: Generating Structured Workflow Outputs From Sketch Images
- URL: http://arxiv.org/abs/2503.21889v1
- Date: Thu, 27 Mar 2025 18:04:05 GMT
- Title: StarFlow: Generating Structured Workflow Outputs From Sketch Images
- Authors: Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian,
- Abstract summary: We introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models.<n>We finetune and benchmark multiple vision-language models to analyze the strengths and limitations of our approach.<n>Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
- Score: 10.870956565888545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
Related papers
- Opus: A Large Work Model for Complex Workflow Generation [0.0]
Opus is a framework for generating and optimizing tasks tailored to complex Business Process Outsourcing (BPO) use cases.<n>Our approach generates executables from Intention, defined as the alignment of Client Input, Client Output and Process Directed Context.
arXiv Detail & Related papers (2024-11-30T20:00:41Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation [87.39861573270173]
We introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt.
We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows.
Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.
arXiv Detail & Related papers (2024-10-02T16:43:24Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers [54.83459025465947]
Even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting.
Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools.
We present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples.
arXiv Detail & Related papers (2024-01-03T20:48:47Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.