An Empirical Investigation into the Use of Image Captioning for
Automated Software Documentation
- URL: http://arxiv.org/abs/2301.01224v1
- Date: Tue, 3 Jan 2023 17:15:18 GMT
- Title: An Empirical Investigation into the Use of Image Captioning for
Automated Software Documentation
- Authors: Kevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, Michele
Tufano, Carlos Bernal-C\'ardenas, Denys Poshyvanyk, Zach H'Doubler
- Abstract summary: This paper investigates the connection between Graphical User Interfaces and functional, natural language descriptions of software.
We collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications.
To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input.
- Score: 17.47243004709207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing automated techniques for software documentation typically attempt to
reason between two main sources of information: code and natural language.
However, this reasoning process is often complicated by the lexical gap between
more abstract natural language and more structured programming languages. One
potential bridge for this gap is the Graphical User Interface (GUI), as GUIs
inherently encode salient information about underlying program functionality
into rich, pixel-based data representations. This paper offers one of the first
comprehensive empirical investigations into the connection between GUIs and
functional, natural language descriptions of software. First, we collect,
analyze, and open source a large dataset of functional GUI descriptions
consisting of 45,998 descriptions for 10,204 screenshots from popular Android
applications. The descriptions were obtained from human labelers and underwent
several quality control mechanisms. To gain insight into the representational
potential of GUIs, we investigate the ability of four Neural Image Captioning
models to predict natural language descriptions of varying granularity when
provided a screenshot as input. We evaluate these models quantitatively, using
common machine translation metrics, and qualitatively through a large-scale
user study. Finally, we offer learned lessons and a discussion of the potential
shown by multimodal models to enhance future techniques for automated software
documentation.
Related papers
- GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples.
This task presents unique challenges compared to natural scene video captioning.
We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - UReader: Universal OCR-free Visually-situated Language Understanding
with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM)
By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters.
Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Accountable Textual-Visual Chat Learns to Reject Human Instructions in
Image Re-creation [26.933683814025475]
We introduce two novel multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K).
These datasets incorporate both visual and text-based inputs and outputs.
To facilitate the accountability of multimodal systems in rejecting human requests, similar to language-based ChatGPT conversations, we introduce specific rules as supervisory signals within the datasets.
arXiv Detail & Related papers (2023-03-10T15:35:11Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Exploring External Knowledge for Accurate modeling of Visual and
Language Problems [2.7190267444272056]
This dissertation focuses on visual and language understanding which involves many challenging tasks.
The state-of-the-art methods for solving these problems usually involves only two parts: source data and target labels.
We developed a methodology that we can first extract external knowledge and then integrate it with the original models.
arXiv Detail & Related papers (2023-01-27T02:01:50Z) - Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language
Models [58.42146641102329]
We develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC)
KiC empowers a parametric text-to-text language model with a knowledge-rich external memory.
As a knowledge-rich semi-parametric language model, KiC only needs a much smaller part to achieve superior zero-shot performance on unseen tasks.
arXiv Detail & Related papers (2022-10-28T23:18:43Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.