Fugu-MT 論文翻訳(概要): WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

論文の概要: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

arxiv url: http://arxiv.org/abs/2103.06561v2
Date: Sat, 13 Mar 2021 07:52:50 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-16 11:54:59.685902
Title: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Title（参考訳）: wenlan: 大規模マルチモーダル事前学習によるビジョンと言語橋渡し
Authors: Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, Shizhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, and Ji-Rong Wen
Abstract要約: クロスモーダルコントラスト学習フレームワークにおいて,BriVLと呼ばれる2重塔前訓練モデルを提案する。単純なコントラスト学習手法を採用したopenaiクリップとは異なり,最新のメソッドmocoをクロスモーダルシナリオに適用することにより,より高度なアルゴリズムを考案する。大規模なキューベースの辞書を構築することで、BriVLは限られたGPUリソースにネガティブなサンプルを組み込むことができます。
参考スコア（独自算出の注目度）: 71.37731379031487
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
Abstract（参考訳）: マルチモーダル事前学習モデルは近年,視覚と言語を橋渡しする試みが盛んに行われている。しかし、それらのほとんどは、テキストと画像のモダリティの間に強い意味的相関が存在すると仮定して、画像とテキストのペア間の相互モーダル相互作用を明示的にモデル化する。この強い仮定は実世界のシナリオでは無効であることが多いため、我々のチームが主導する中国のプロジェクト「WenLan」の焦点である大規模マルチモーダル事前学習の相互モーダル相関を暗黙的にモデル化することを選択します。具体的には,画像テキスト対に対する弱い相関仮定を用いて,交叉型コントラスト学習フレームワークにおいて,brivlと呼ばれる2層事前学習モデルを提案する。単純なコントラスト学習手法を採用したopenaiクリップとは異なり,最新のメソッドmocoをクロスモーダルシナリオに適用することにより,より高度なアルゴリズムを考案する。大規模なキューベースの辞書を構築することで、BriVLは限られたGPUリソースにネガティブなサンプルを組み込むことができます。さらに,我々の BriVL モデルを事前学習するための RUC-CAS-WenLan という,中国の大規模マルチソース画像テキストデータセットを構築した。広範な実験は、事前に訓練されたBriVLモデルが様々な下流タスクでUNITERとOpenAI CLIPの両方を上回っていることを示しています。

論文の概要: WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

関連論文リスト