Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
- URL: http://arxiv.org/abs/2406.09403v3
- Date: Mon, 11 Nov 2024 00:54:32 GMT
- Title: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
- Authors: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna,
- Abstract summary: Sketchpad is a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad.
It enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning.
Sketchpad substantially improves performance on all tasks over strong base models with no sketching.
- Score: 139.9581209765338
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.
Related papers
- Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving [25.22658210339668]
This paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning.
User studies conducted on math problems such as geometry, calculus, and demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels.
arXiv Detail & Related papers (2025-02-12T00:59:25Z) - SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation [6.39528707908268]
There continues to be a lack of large-scale paired datasets for scene sketches.
We propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch.
We contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent "text-sketch-image" triplets.
arXiv Detail & Related papers (2024-05-29T06:43:49Z) - It's All About Your Sketch: Democratising Sketch Control in Diffusion Models [114.73766136068357]
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI.
We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get"
arXiv Detail & Related papers (2024-03-12T01:05:25Z) - SketchXAI: A First Look at Explainability for Human Sketches [104.13322289903577]
This paper introduces human sketches to the landscape of XAI (Explainable Artificial Intelligence)
We argue that sketch as a human-centred'' data form, represents a natural interface to study explainability.
We design a sketch encoder that accommodates the intrinsic properties of strokes: shape, location, and order.
arXiv Detail & Related papers (2023-04-23T20:28:38Z) - Picture that Sketch: Photorealistic Image Generation from Abstract
Sketches [109.69076457732632]
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image.
We do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches.
In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch.
arXiv Detail & Related papers (2023-03-20T14:49:03Z) - Towards Practicality of Sketch-Based Visual Understanding [15.30818342202786]
Sketches have been used to conceptualise and depict visual objects from pre-historic times.
This thesis aims to progress sketch-based visual understanding towards more practicality.
arXiv Detail & Related papers (2022-10-27T03:12:57Z) - I Know What You Draw: Learning Grasp Detection Conditioned on a Few
Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects.
Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z) - SketchDesc: Learning Local Sketch Descriptors for Multi-view
Correspondence [68.63311821718416]
We study the problem of multi-view sketch correspondence, where we take as input multiple freehand sketches with different views of the same object.
This problem is challenging since the visual features of corresponding points at different views can be very different.
We take a deep learning approach and learn a novel local sketch descriptor from data.
arXiv Detail & Related papers (2020-01-16T11:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.