Related papers: Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

URL: http://arxiv.org/abs/2406.09891v1
Date: Fri, 14 Jun 2024 10:02:52 GMT
Title: Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming
Authors: Victor-Alexandru Pădurean, Adish Singla,
Abstract summary: State-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. We fine-tune these models using a novel synthetic data generation methodology. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.
Score: 22.344985623878408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models' performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.

Related papers

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment [23.756311527978486]
The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment. We develop a fine-tuning pipeline to boost the performance of models. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models.
arXiv Detail & Related papers (2024-06-17T08:48:02Z)
Computational Models to Study Language Processing in the Human Brain: A Survey [47.81066391664416]
This paper reviews efforts in using computational models for brain research, highlighting emerging trends. Our analysis reveals that no single model outperforms others on all datasets.
arXiv Detail & Related papers (2024-03-20T08:01:22Z)
Generative Forests [23.554594285885273]
We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods.
arXiv Detail & Related papers (2023-08-07T14:58:53Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
Evaluation of Categorical Generative Models -- Bridging the Gap Between Real and Synthetic Data [18.142397311464343]
We introduce an appropriately scalable evaluation method for generative models. We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks. We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.
arXiv Detail & Related papers (2022-10-28T21:05:25Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
Synthetic Benchmarks for Scientific Research in Explainable Machine Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z)
How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives. Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models. As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.