Related papers: An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation

An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation

URL: http://arxiv.org/abs/2301.01224v1
Date: Tue, 3 Jan 2023 17:15:18 GMT
Title: An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation
Authors: Kevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, Michele Tufano, Carlos Bernal-C\'ardenas, Denys Poshyvanyk, Zach H'Doubler
Abstract summary: This paper investigates the connection between Graphical User Interfaces and functional, natural language descriptions of software. We collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input.
Score: 17.47243004709207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.

Related papers

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements. To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z)
GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning. We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z)
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users. Most existing methods emphasize the user context fusion process by memory networks or transformers. We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K). These datasets incorporate both visual and text-based inputs and outputs. To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
Exploring External Knowledge for Accurate modeling of Visual and Language Problems [2.7190267444272056]
This dissertation focuses on visual and language understanding which involves many challenging tasks. The state-of-the-art methods for solving these problems usually involves only two parts: source data and target labels. We developed a methodology that we can first extract external knowledge and then integrate it with the original models.
arXiv Detail & Related papers (2023-01-27T02:01:50Z)
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models [58.42146641102329]
We develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC) KiC empowers a parametric text-to-text language model with a knowledge-rich external memory. As a knowledge-rich semi-parametric language model, KiC only needs a much smaller part to achieve superior zero-shot performance on unseen tasks.
arXiv Detail & Related papers (2022-10-28T23:18:43Z)
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding. We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.