Fugu-MT 論文翻訳(概要): Visually Grounded Reasoning across Languages and Cultures

論文の概要: Visually Grounded Reasoning across Languages and Cultures

arxiv url: http://arxiv.org/abs/2109.13238v1
Date: Tue, 28 Sep 2021 16:51:38 GMT
ステータス: 翻訳完了
システム内更新日: 2021-09-29 15:28:33.255846
Title: Visually Grounded Reasoning across Languages and Cultures
Title（参考訳）: 言語と文化にまたがる視覚的な推論
Authors: Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, Desmond Elliott
Abstract要約: 我々は、より多くの言語や文化を表すImageNetスタイルの階層を構築するための新しいプロトコルを開発する。我々は、インドネシア語、中国語、スワヒリ語、タミル語、トルコ語など、類型的に多様な言語群に焦点を当てている。画像のペアについて,ネイティブ話者アノテータから文を抽出することにより,多言語による視覚・言語上の多言語推論(MARVL)データセットを作成する。
参考スコア（独自算出の注目度）: 27.31020761908739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
Abstract（参考訳）: 広く使われている視覚・言語データセットと事前訓練されたエンコーダの設計は、ImageNetの概念やイメージを直接採用するか、あるいはインスピレーションを引き出す。このベンチマークがコンピュータビジョンの進歩にどの程度貢献したかは過大評価できないが、主に英語の語彙データベースや画像クエリから派生したもので、北米や西欧の偏見を持つ資料となっている。そこで我々は,より多くの言語や文化を表すImageNetスタイルの階層を構築するための新しいプロトコルを考案した。特に、概念とイメージの選択は、自動的にスクラップするのではなく、ネイティブスピーカーによって完全に駆動されます。具体的には,インドネシア語,中国語,スワヒリ語,タミル語,トルコ語の類型的に多様な言語群に焦点を当てる。この新プロトコルを用いて得られた概念と画像に基づいて,ネイティブ話者アノテータから画像のペアに関する文を抽出することにより, {M}ulticultur{a}l {R}easoning over {V}ision と {L}anguage (MARVL) の多言語データセットを作成する。このタスクは、それぞれの根拠のある文が真か偽かを識別する。我々は,最先端モデルを用いた一連のベースラインを確立し,それらの言語間伝達性能が英語における教師付き性能よりも劇的に遅れていることを見いだした。これらの結果は、狭い領域を超えた現在の最先端モデルの堅牢性と正確性を再評価すると同時に、真に多言語多文化システムを開発するための新たなエキサイティングな課題を提起します。

論文の概要: Visually Grounded Reasoning across Languages and Cultures

関連論文リスト