Fugu-MT 論文翻訳(概要): Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

論文の概要: Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

arxiv url: http://arxiv.org/abs/2105.11941v1
Date: Tue, 25 May 2021 13:45:54 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-26 13:46:14.450762
Title: Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
Title（参考訳）: モバイルGUIを理解する:Pixel-WordsからScreen-Sentencesへ
Authors: Jingwen Fu, Xiaoyi Zhang, Yuwang Wang, Wenjun Zeng, Sam Yang and Grayson Hilliard
Abstract要約: モバイルGUI理解アーキテクチャを提案する:Pixel-Words to Screen-Sentence (PW2SS) Pixel-Wordsはアトミックビジュアルコンポーネントとして定義されており、スクリーンショット全体で視覚的に一貫性があり、セマンティックにクリアである。トレーニングデータで利用可能なメタデータを使って、Pixel-Wordsの高品質なアノテーションを自動生成できます。
参考スコア（独自算出の注目度）: 48.97215653702567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ubiquity of mobile phones makes mobile GUI understanding an important task. Most previous works in this domain require human-created metadata of screens (e.g. View Hierarchy) during inference, which unfortunately is often not available or reliable enough for GUI understanding. Inspired by the impressive success of Transformers in NLP tasks, targeting for purely vision-based GUI understanding, we extend the concepts of Words/Sentence to Pixel-Words/Screen-Sentence, and propose a mobile GUI understanding architecture: Pixel-Words to Screen-Sentence (PW2SS). In analogy to the individual Words, we define the Pixel-Words as atomic visual components (text and graphic components), which are visually consistent and semantically clear across screenshots of a large variety of design styles. The Pixel-Words extracted from a screenshot are aggregated into Screen-Sentence with a Screen Transformer proposed to model their relations. Since the Pixel-Words are defined as atomic visual components, the ambiguity between their visual appearance and semantics is dramatically reduced. We are able to make use of metadata available in training data to auto-generate high-quality annotations for Pixel-Words. A dataset, RICO-PW, of screenshots with Pixel-Words annotations is built based on the public RICO dataset, which will be released to help to address the lack of high-quality training data in this area. We train a detector to extract Pixel-Words from screenshots on this dataset and achieve metadata-free GUI understanding during inference. We conduct experiments and show that Pixel-Words can be well extracted on RICO-PW and well generalized to a new dataset, P2S-UI, collected by ourselves. The effectiveness of PW2SS is further verified in the GUI understanding tasks including relation prediction, clickability prediction, screen retrieval, and app type classification.
Abstract（参考訳）: 携帯電話のユビキタス性は、モバイルguiの理解を重要なタスクにする。このドメインの以前のほとんどの作品は、画面(例えば、画面)のメタデータを人間が生成する必要がある。残念なことに、GUIを理解するのに十分な信頼性を持っていないことが多い。 NLPタスクにおけるトランスフォーマーの成功に触発され、純粋に視覚ベースのGUI理解を目指して、Words/Sentenceの概念をPixel-Words/Screen-Sentenceに拡張し、モバイルGUI理解アーキテクチャであるPixel-Words to Screen-Sentence (PW2SS)を提案する。個々の単語の例えとして、ピクセルワードをアトミックなビジュアルコンポーネント(テキストやグラフィックコンポーネント)として定義し、様々なデザインスタイルのスクリーンショットを通して視覚的に一貫性があり、意味的に明確である。スクリーンショットから抽出されたPixel-Wordは、その関係をモデル化するために提案されたスクリーントランスフォーマーでScreen-Sentenceに集約される。 Pixel-Wordsはアトミックビジュアルコンポーネントとして定義されているため、視覚的外観とセマンティクスのあいまいさは劇的に減少する。トレーニングデータで利用可能なメタデータを使って、Pixel-Wordsの高品質なアノテーションを自動生成できます。 Pixel-Wordsアノテーション付きのスクリーンショットのデータセットであるRICO-PWは、公開のRICOデータセットに基づいて構築されている。このデータセットのスクリーンショットからPixel-Wordを抽出し,推論中にメタデータのないGUI理解を実現するために,検出器をトレーニングする。我々は実験を行い、Pixel-WordsをRICO-PW上で適切に抽出し、新たなデータセットであるP2S-UIに適切に一般化できることを示す。 PW2SSの有効性は、関係予測、クリック可能性予測、画面検索、アプリタイプの分類を含むGUI理解タスクにおいてさらに検証される。

論文の概要: Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

関連論文リスト