More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve
Visually Diverse Images of Parsons Problems
- URL: http://arxiv.org/abs/2311.04926v1
- Date: Fri, 3 Nov 2023 14:47:17 GMT
- Title: More Robots are Coming: Large Multimodal Models (ChatGPT) can Solve
Visually Diverse Images of Parsons Problems
- Authors: Irene Hou, Owen Man, Sophie Mettille, Sebastian Gutierrez, Kenneth
Angelikas, Stephen MacNeil
- Abstract summary: We evaluate the performance of two large multimodal models on visual assignments.
GPT-4V solved 96.7% of these visual problems, struggling minimally with a single Parsons problem.
Bard performed poorly by only solving 69.2% of problems, struggling with common issues like hallucinations and refusals.
- Score: 0.4660328753262075
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The advent of large language models is reshaping computing education. Recent
research has demonstrated that these models can produce better explanations
than students, answer multiple-choice questions at or above the class average,
and generate code that can pass automated tests in introductory courses. These
capabilities have prompted instructors to rapidly adapt their courses and
assessment methods to accommodate changes in learning objectives and the
potential for academic integrity violations. While some scholars have advocated
for the integration of visual problems as a safeguard against the capabilities
of language models, new multimodal language models now have vision and language
capabilities that may allow them to analyze and solve visual problems. In this
paper, we evaluate the performance of two large multimodal models on visual
assignments, with a specific focus on Parsons problems presented across diverse
visual representations. Our results show that GPT-4V solved 96.7\% of these
visual problems, struggling minimally with a single Parsons problem.
Conversely, Bard performed poorly by only solving 69.2\% of problems,
struggling with common issues like hallucinations and refusals. These findings
suggest that merely transitioning to visual programming problems might not be a
panacea to issues of academic integrity in the generative AI era.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Multi-qubit state visualizations to support problem solving $-$ a pilot study [1.8879980022743639]
We compare students' performance, time taken, and cognitive load when solving problems using the mathematical-symbolic Dirac notation alone with using it accompanied by the circle notation or the dimensional circle notation in single- and multi-qubit systems.
Although little overall differences in students' performance can be detected depending on the presented representations, we observe that problem-solving performance is student- and context-dependent.
arXiv Detail & Related papers (2024-06-24T11:46:35Z) - CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.
After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.
Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - Multimodal Analysis Of Google Bard And GPT-Vision: Experiments In Visual
Reasoning [0.0]
We subject Google Bard and GPT-Vision to 64 visual tasks spanning categories like "Visual Situational Reasoning" and "Next Scene Prediction"
Our findings spotlight both vision-language model's limitations.
arXiv Detail & Related papers (2023-08-17T03:14:00Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Visual Perturbation-aware Collaborative Learning for Overcoming the
Language Prior Problem [60.0878532426877]
We propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration.
Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents.
The experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness.
arXiv Detail & Related papers (2022-07-24T23:50:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.