Related papers: Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

URL: http://arxiv.org/abs/2406.11334v1
Date: Mon, 17 Jun 2024 08:48:02 GMT
Title: Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment
Authors: Chao Wen, Jacqueline Staub, Adish Singla,
Abstract summary: The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment. We develop a fine-tuning pipeline to boost the performance of models. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models.
Score: 23.756311527978486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment. The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment, each requiring a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models, and provide an in-depth analysis of the models' expertise across different skill dimensions. We will publicly release the benchmark for future research on program synthesis in visual programming.

Related papers

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [105.25503508433758]
We introduce $textbfZebra-CoT$, a diverse large-scale dataset with 182,384 samples.<n>We focus on four categories of tasks where sketching or visual reasoning is especially natural.<n>Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains.
arXiv Detail & Related papers (2025-07-22T16:35:36Z)
ProBench: Benchmarking Large Language Models in Competitive Programming [44.09445715541973]
We propose ProBench to benchmark large language models (LLMs) in competitive programming. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms. We assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation.
arXiv Detail & Related papers (2025-02-28T09:12:42Z)
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models [103.25208095165486]
Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. We present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy.
arXiv Detail & Related papers (2024-12-09T21:44:02Z)
In-Context Code-Text Learning for Bimodal Software Engineering [26.0027882745058]
Bimodal software analysis initially appeared to be within reach with the advent of large language models. We postulate that in-context learning for the code-text bimodality is a promising avenue. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format.
arXiv Detail & Related papers (2024-10-08T19:42:00Z)
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling [22.885385107905222]
We introduce UniBench, a unified implementation of 50+ vision-language model (VLM) benchmarks. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models. We also release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled set of benchmarks that runs in 5 minutes on a single GPU.
arXiv Detail & Related papers (2024-08-09T01:41:05Z)
Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming [22.344985623878408]
State-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. We fine-tune these models using a novel synthetic data generation methodology. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.
arXiv Detail & Related papers (2024-06-14T10:02:52Z)
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools. Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z)
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback [75.62712247421146]
De-fine is a training-free framework that decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. Our experiments across various visual tasks show that De-fine creates more robust programs.
arXiv Detail & Related papers (2023-11-21T06:24:09Z)
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code [24.936022005837415]
We review the recent advancements in software engineering with language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code. We also go beyond programming and review LLMs' application in other software engineering activities including requirement engineering, testing, deployment, and operations.
arXiv Detail & Related papers (2023-11-14T08:34:26Z)
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z)
NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks. Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth. Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z)
How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)
Reactive Long Horizon Task Execution via Visual Skill and Precondition Models [59.76233967614774]
We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner. We show an increase in success rate from 91.6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines.
arXiv Detail & Related papers (2020-11-17T15:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.