People perceive the world with multiple senses (e.g., through hearing sounds,
reading words and seeing objects). However, most existing AI systems only
process an individual modality. This paper presents an approach that excels at
handling multiple modalities of information with a single model. In our
"{SkillNet}" model, different parts of the parameters are specialized for
processing different modalities. Unlike traditional dense models that always
activate all the model parameters, our model sparsely activates parts of the
parameters whose skills are relevant to the task. Such model design enables
SkillNet to learn skills in a more interpretable way. We develop our model for
five modalities including text, image, sound, video and code. Results show
that, SkillNet performs comparably to five modality-specific fine-tuned models.
Moreover, our model supports self-supervised pretraining with the same sparsely
activated way, resulting in better initialized parameters for different
modalities. We find that pretraining significantly improves the performance of
SkillNet on five modalities, on par with or even better than baselines with
modality-specific pretraining. On the task of Chinese text-to-image retrieval,
our final system achieves higher accuracy than existing leading systems
including Wukong{ViT-B} and Wenlan 2.0 while using less number of activated
parameters.
Unlike traditional dense models that always activate all the model parameters, our model sparsely activates parts of the parameters whose skills are relevant to the task.
Such model design enables SkillNet to learn skills in a more interpretable way.
このようなモデル設計により、skillnetはより解釈可能な方法でスキルを学ぶことができる。
0.53
We develop our model for five modalities including text, image, sound, video and code.
我々は,テキスト,画像,音声,ビデオ,コードを含む5つのモードのモデルを開発した。
0.71
Results show that, SkillNet performs comparably to five modality-specific fine-tuned models.
その結果、skillnetは5つのモダリティ特有の微調整モデルに比較可能な性能を示す。
0.53
Moreover, our model supports selfsupervised pretraining with the same sparsely activated way, resulting in better initialized parameters for different modalities.
We find that pretraining significantly improves the performance of SkillNet on five modalities, on par with or even better than baselines with modality-specific pretraining.
On the task of Chinese text-to-image retrieval, our final system achieves higher accuracy than existing leading systems including WukongViT-B and Wenlan 2.0 while using less number of activated parameters.
1 Introduction In recent years, Transformer [40] and Transformer-based pretrained models [12, 35] have revolutionized natural language processing [33] and there have been growing interests in extending the successful paradigm to broader artificial intelligence areas including computer vision [8, 23, 32], speech processing [4] and program analysis [18].
Researchers from different communities have no communication barrier and typically repeat the same process: pretraining for each modality and finetuning all the model parameters for each task.
∗Correspondence to: Duyu Tang (duyutang@tencent.co m), ∗ indicates equal contribution
∗ 対応:duyu tang (duyutang@tencent.co m) ∗ は等しい貢献を示す。
0.73
英語(論文から抽出)
日本語訳
スコア
new things quickly. However, existing methods typically learn for each task from scratch (or from a general or foundation model), resulting in hundreds of models for hundreds of tasks.
In SkillNet, different parts of the parameters are specialized for different skills.
SkillNetでは、パラメータの異なる部分が異なるスキルに特化されている。
0.75
When the model is applied to a downstream task, unlike traditional “dense” models that always activate all the model parameters, it “sparsely” activates parts of the parameters whose skills are relevant to the target task.
For example, we could define five modality-related skills {stext, simage, ssound, svideo, scode}, which are specialized for understanding text, image, sound, video and code, respectively.
Figure 1 gives high-level illustrations of the aforementioned situations.
図1は、前述の状況の高レベルな図示です。
0.70
There are many different ways to implement SkillNet.
SkillNetを実装するにはさまざまな方法があります。
0.65
In this work, we provide a simple implementation on top of Transformer [40].
この作業では、Transformer [40]の上にシンプルな実装を提供します。
0.75
Instead of producing general K/Q/V vectors for each token, we activate different modality-specific parameters to produce different modality-specific K/Q/V vectors before conducting multi-head attention.
The intuition is that we expect the model to call upon different parts as needed to process different types of signals and combine information from multiple senses to form our understanding about a concept (like the aforementioned example about the concept of dog).
We conduct experiments on tasks of five modalities, including text classification, automatic speech recognition, text-to-image retrieval, text-to-video retrieval and text-to-code retrieval.
On the task of Chinese text-to-image retrieval, SkillNet obtains higher accuracy than existing systems (e g , WukongViT-B and Wenlan 2.0) while using less number of activated parameters.
Our work demonstrates the feasibility of developing one general model that is both accuracy and efficient to tackle multiple tasks of different modalities.
本研究は,モーダルの異なる複数のタスクに精度と効率を両立させる1つの汎用モデルの実現可能性を示す。
0.70
2 Comparison to Existing Methods
2 既存手法との比較
0.87
We describe the connections and differences of this work to related multimodal, multitask and mixture-of-experts methods.
Multitask This work also relates to multitask learning methods.
マルチタスク この研究はマルチタスク学習手法にも関係している。
0.64
Systems built upon Transformer typically use shared feature encoder plus task-specific prediction layers for understanding tasks [29] and use natural language prompts to steer encoder-decoder model for generation tasks [37].
This work can be viewed as an extension to the multimodal situation.
この作業は、マルチモーダルな状況への拡張と見なすことができる。
0.80
Mixture-of-Expert (MoE) Transformer-based MoE methods typically include multiple homogeneous neural networks (called experts), which can be fully activated or partially activated guided by an additional gating function [15, 16, 27, 38].
However, it is unclear what type of knowledge is learned in each expert and why an expert is activated.
しかし、専門家ごとにどのような知識が学習されるのか、なぜ専門家が活性化されるのかは不明である。
0.62
From this point of view, our approach can be viewed as a sparse multimodal MoE.
この観点から、我々のアプローチはスパースマルチモーダルMOEと見なすことができる。
0.65
Unlike traditional MoE methods, each expert in our model has a clear definition and the activation of each expert has a clear reason (judged by human experts).
3 Method This section gives our sparsely activated model SkillNet.
3方法 本節では,sparsely activated model skillnetについて述べる。
0.68
We first give a brief background on Transformer (§3.1).
はじめに Transformer について簡単な背景 (3.1) を述べます。
0.64
Then, we describe the model architecture of SkillNet (§3.2).
次に,skillnetのモデルアーキテクチャについて述べる(3.2)。
0.67
Finally, we describe how to produce the embeddings for different modalities (§3.3).
最後に、異なるモジュラリティに対する埋め込みの作り方について述べる(3.3)。
0.57
3.1 Background on Transformer
3.1 Transformer の背景
0.82
To make our paper self-contained, we briefly describe Transformer here.
論文を自己完結させるため、Transformerについて簡単に説明する。
0.54
Transformer [40] is a commonly used model architecture with multiple layers, each of which consists of a multi-head attention layer followed by a feed-forward network (FFN) layer.
Specifically, we modify the multi-head attention of each Transformer layer as follows.
具体的には,各トランスフォーマー層のマルチヘッドアテンションを次のように修正する。
0.61
Instead of producing general K/Q/V vectors for each token, we activate different modality-specific parameters to produce different modality-specific K/Q/V vectors before conducting multi-head attention.
Instead of having only one projection matrix W Q for all queries, we have five projection parameter i matrices {W Qtext , W Qvideo , W Qcode }, of which each item stands for a skill of understanding the information of a particular modality.
すべてのクエリに対して1つの投影行列 wq を持つ代わりに、5つの投影パラメータ i 行列 {w qtext , w qvideo , w qcode } を持ち、それぞれが特定のモダリティの情報を理解するスキルを表す。
0.73
When the model is applied to a task, we only activate the corresponding projection matrices of relevant skills.
モデルがタスクに適用されると、対応する関連するスキルの投影行列のみを活性化する。
0.77
Similar modifications are made for keys and values.
Figure 3: Architecture of SkillNet for image retrieval.
図3:画像検索のためのSkillNetのアーキテクチャ。
0.73
Text encoder and image encoder are two pathways of one shared model — stext and simage are activated for the text encoder and the image encoder, respectively.
(5) As shown in Figure 4, we only need one model to handle the the task of image retrieval, where we activate stext and simage for the text encode and image encoder, respectively.
3.3 Embeddings We describe how to produce the embeddings for different modalities.
3.3 埋め込み 異なるモダリティに対する埋め込みの作り方について述べる。
0.63
Text Following BERT [12], we tokenize a text into a sequence of wordpiece tokens [45] and build the embedding of each wordpiece by adding up its token embedding, position embedding and segment embedding.
Specifically, we use seven convolutions with 512 channels, strides of (5,2,2,2,2,2,2) and kernel widths of (10,3,3,3,3,2,2) to generate a vector sequence from a 20ms framerate sampled at 16KHz.
After that, we adopt a 1D convolutional network to transform the vector sequence to 768 dimensional embeddings, which are summed up with their corresponding position embeddings as the final sound embeddings.
5 dot product𝑊𝑊𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑄𝑄𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑄𝑄𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑄𝑄𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑄𝑄𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡Multi-HeadAttentionF eed ForwardN×PoolingAdd & NormAdd & Norm𝑊𝑊𝐾𝐾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝐾𝐾𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝐾𝐾𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝐾𝐾𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝐾𝐾𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑉𝑉𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑉𝑉𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑉𝑉𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑉𝑉𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑉𝑉𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑄𝑄𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑄𝑄𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑄𝑄𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑄𝑄𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡Multi-HeadAttentionF eed ForwardN×PoolingAdd & NormAdd & Norm𝑊𝑊𝐾𝐾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝐾𝐾𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝐾𝐾𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝐾𝐾𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝐾𝐾𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑉𝑉𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑉𝑉𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑉𝑉𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑉𝑉𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑉𝑉𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡A puppy is playing a frisbee with a women.
5 dot product𝑊𝑊𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑄𝑄𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑄𝑄𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑄𝑄𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑄𝑄𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡Multi-HeadAttentionF eed ForwardN×PoolingAdd & NormAdd & Norm𝑊𝑊𝐾𝐾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝐾𝐾𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝐾𝐾𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝐾𝐾𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝐾𝐾𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑉𝑉𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑉𝑉𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑉𝑉𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑉𝑉𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑉𝑉𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑄𝑄𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑄𝑄𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑄𝑄𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑄𝑄𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡Multi-HeadAttentionF eed ForwardN×PoolingAdd & NormAdd & Norm𝑊𝑊𝐾𝐾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝐾𝐾𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝐾𝐾𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝐾𝐾𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝐾𝐾𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑉𝑉𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑉𝑉𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑉𝑉𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑉𝑉𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑉𝑉𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡A puppy is playing a frisbee with a women.
0.23
英語(論文から抽出)
日本語訳
スコア
Figure 4: An illustration of the pipeline and the embeddings of different modalities.
図4: パイプラインのイラストと異なるモダリティの埋め込み。
0.53
Image Following Vision Transformer (ViT)[13], we build patch embeddings for each image.
image following vision transformer (vit)[13]各イメージに対するパッチ埋め込みを構築します。
0.73
We first reshape each image of x ∈ RH×W×C into 2D patches of xp ∈ RN×(P 2·C), where (H, W ) is the image resolution, (P, P ) is the resolution of each patch, N is the number of patches and C is the number of image channels (e g 3 for RGB).
まず、x ∈ RH×W×C の各像を xp ∈ RN×(P 2·C) の2次元パッチに変換し、(H, W ) は画像分解能、(P, P ) は各パッチの分解能、(N はパッチ数、C は画像チャネル数(例えば RGB は3)である。
0.70
Then, a 2D convolutional network is applied to transform patch pixels to 768 dimensional embeddings, which are added with the corresponding position embeddings as the final patch embeddings.2
We add a special token [CLSimage] at the beginning of each sequence to produce the representation of the image.
各シーケンスの先頭に特別なトークン[CLSimage]を追加して、画像の表現を生成します。
0.70
Video We follow Vivit [2], an extension of ViT for video, to produce video embeddings.
ビデオ vitの拡張であるvivit [2]に従って、ビデオ埋め込みを作成します。 訳抜け防止モード: Video We follow Vivit [2 ], a extension of ViT for video? ビデオの埋め込みを作ります
0.74
Given a video V ∈ RT×H×W×C, where T is the number of sampled frames, we extract [T /t]· [H/h]· [W/w] non-overlapping, spatio-temporal “tubes” and use a 3D convolution to produce a representation for each tube.
T がサンプルフレームの数であるビデオ V ∈ RT×H×W×C が与えられると、[T /t]· [H/h]· [W/w] が重複しない時空間 “チューブ” を抽出し、3次元畳み込みを用いて各チューブの表現を生成する。
0.86
We further add [T /t] + [H/h] + [W/w] positional embeddings and concatenate a special token [CLSvideo] at the beginning of each sequence to represent the whole video input.
4 Tasks In this section, we first describe downstream tasks involving five modalities in §4.1.
4つの課題 本節では,まず,4.1 における5つのモダリティを含む下流タスクについて述べる。
0.55
Each modality relates to an active research area that covers many tasks.
それぞれのモダリティは、多くのタスクをカバーするアクティブな研究領域に関係しています。
0.57
We select one task for each modality with preferences for well recognized tasks (e g , ASR) and tasks relate to multiple modalities (e g , video/code retrieval).
6 TextSoundImageVideoC odeA puppy is playing a frisbee with a women.defPrintCaptio n():print(A puppy is playing a frisbee with a women')3D-CNN𝑊𝑊𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑄𝑄𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑄𝑄𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑄𝑄𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑄𝑄𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡Multi-HeadAttentionF eed ForwardN×Add & NormAdd & Norm𝑊𝑊𝐾𝐾𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝐾𝐾𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝐾𝐾𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝐾𝐾𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝐾𝐾𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡𝑊𝑊𝑉𝑉𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑊𝑊𝑉𝑉𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑊𝑊𝑉𝑉𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑊𝑊𝑉𝑉𝑣𝑣𝑖𝑖𝑠𝑠𝑡𝑡𝑠𝑠𝑊𝑊𝑉𝑉𝑐𝑐𝑠𝑠𝑠𝑠𝑡𝑡3D-CNNCNN
Considering the efficiency of the inference stage, we use two separate passes (like Siamese Network) to produce text and image vectors separately with no cross-modality attention.
Notably, we use the same model with different activation configurations (i.e., stext is activated for text and the simage is activated for image) to produce text and image vectors.
The semantic similarity between a text and an image is calculated with dot product or cosine function.
テキストと画像の間の意味的類似性は、ドット積またはコサイン関数によって計算される。
0.67
Video We consider text-to-video retrieval.
動画検索について考察する。
0.40
Given a text as the query, the task is to find the target video from a set of candidates.
クエリとしてテキストが与えられると、タスクは、一連の候補からターゲットビデオを見つけることである。
0.80
The framework is similar to the aforementioned image retrieval.
この枠組みは前述の画像検索と類似している。
0.77
We use the same model with different activated parameters (i.e., stext is avtivated for text and svideo is activated for video) to produce text and video vectors separately.
Each masked token is replaced with a special [MASK] token 80% of the time, a random token 10% of the time, and left unchanged for the remaining 10% of the time.
Sound We develop a simplified version of HuBERT [25] and pretrain through predicting the categories of the masked sound tokens, whose target labels are produced with an offline clustering process.
We use the same masking strategies of wav2vec2 [4], where about 5% of the time-steps are randomly sampled as start indices and the subsequent 10 time-steps are masked.
The result tagged with † is from the previous best system for Chinese text-to-image retrieval [21], whose pretraining image corpus is also the superset of our image pretraining data.
The pretraining task is to predict the masked tokens.
事前トレーニングのタスクは、マスクされたトークンを予測することです。
0.54
5 Experiments 5.1 Setup
5 実験 5.1 設定
0.74
We compare to the following baselines.
我々は以下の基準と比較する。
0.74
• Modality-specific models.
•モダリティ固有のモデル。
0.63
We train five different models for different modalities.
私たちは5つの異なるモデルを異なるモダリティで訓練します。
0.54
The model architecture for each modality is the standard Transformer.
各モダリティのモデルアーキテクチャは標準トランスフォーマーである。
0.59
• Dense multimodal baseline.
•高密度マルチモーダルベースライン。
0.72
We train a multimodal model that jointly learns for five modalities.
5つのモダリティを共同で学習するマルチモーダルモデルを訓練する。
0.70
This is a dense model in that all these modalities share a common standard Transformer architecture, which is equivalant to SkillNet with only one skill and that skill is always activated.
Since the parameters of SkillNet can be pretrained (as described in §4.2), we have two model configurations, depending on whether the parameters are pretrained in the same sparsely activated manner.
For image, we compare to WukongViT-B [21], which has the similar model scale (with 12 Transformer layers) and is pretrained with a superset of our image pretraining data.
Details about the datasets and training process are given in the Appendix.
データセットとトレーニングプロセスの詳細はAppendixに記載されている。
0.83
5.2 Results and Analysis Table 2 gives the results on five tasks.
5.2 結果と分析 表2は5つのタスクの結果を与えます。
0.76
Systems in the first group are not pretrained.
第1群のシステムは事前訓練されていない。
0.66
We can see that SkillNet performs comparably to modality-specific models.
SkillNetは、モダリティ固有のモデルと互換性がある。
0.61
An interesting finding is that the joint model with a dense encoder is not friendly to the low-resource task like text-to-video, but this phenomenon does not exist in either MoE system or SkillNet.
On the task of text-to-image retrieval, SkillNet achieves better accuracy compared to existing leading systems but using less number of activated parameters.
The parameters of Wenlan 2.0 [17] include three parts, an image encoder consisting of an EfficientNet-B7 [39] (66M) and four Transformer encoder layers (50M), a text encoder RoBERTa-Large [10] (326M) and a cross-modal projection layer with two fully-connected layers (3M).
WukongViT-B [21] includes a Vision Transformer (ViT) [14] (86M) as the image encoder, a standard decoder-only transformer (110M) as the text encoder and a linear cross-modal projection layer (0.6M).
We further show that sparse pretraining gives a better initialized parameters which leads to improved accuracy, even better than modality-specific pretraining on three of five
10 QueryModel OutputsGround Truth湍急的河水里有一群穿着救生衣的人在划橡皮艇(Trans: A group of people in life jackets are rowing a rubber dinghy in a fast river)一个戴着墨镜的男人牵着一个穿着白色裙子的女人走在道路上(Trans: A man in sunglasses walks down the road holding a woman’s hand in a white dress)一个背着包的女人走在人来人往的街道上(Trans: A woman with a bag is walking on a busy street)一个双臂抬起的运动员跪在绿茵茵的球场上(Trans: An athlete with raised arms kneels on a green field)展板前的桌子前一个戴着眼镜的男人旁有一个双手相握的男人在讲话(Trans: A man with clasped hands is speaking next to a man with glasses at a table in front of a display board)score = 34.91score =30.88 score =23.377 score =33.03 score =30.69 score =29.40 score =30.90 score = 32.48score =29.65 score =28.20 score =28.01 score =32.60 score = 29.50score = 34.04score =28.29 QueryModel Outputs 一个穿红色衣服的男人坐在鳄鱼身上用手摸着它的嘴巴(Trans: A man in red sits on a crocodile and touches its mouth with his hands)score = 29.63(ground truth)score = 25.66score = 23.344一个穿着红色衣服的人正在和人演戏(Trans: A person in red is acting with others)score = 30.21(ground truth)score = 27.61score = 26.39
10 QueryModel OutputsGround Truth湍急的河水里有一群穿着救生衣的人在划橡皮艇(Trans: A group of people in life jackets are rowing a rubber dinghy in a fast river)一个戴着墨镜的男人牵着一个穿着白色裙子的女人走在道路上(Trans: A man in sunglasses walks down the road holding a woman’s hand in a white dress)一个背着包的女人走在人来人往的街道上(Trans: A woman with a bag is walking on a busy street)一个双臂抬起的运动员跪在绿茵茵的球场上(Trans: An athlete with raised arms kneels on a green field)展板前的桌子前一个戴着眼镜的男人旁有一个双手相握的男人在讲话(Trans: A man with clasped hands is speaking next to a man with glasses at a table in front of a display board)score = 34.91score =30.88 score =23.377 score =33.03 score =30.69 score =29.40 score =30.90 score = 32.48score =29.65 score =28.20 score =28.01 score =32.60 score = 29.50score = 34.04score =28.29 QueryModel Outputs 一个穿红色衣服的男人坐在鳄鱼身上用手摸着它的嘴巴(Trans: A man in red sits on a crocodile and touches its mouth with his hands)score = 29.63(ground truth)score = 25.66score = 23.344一个穿着红色衣服的人正在和人演戏(Trans: A person in red is acting with others)score = 30.21(ground truth)score = 27.61score = 26.39 訳抜け防止モード: 10 問合せモデルによる真理の抽出 ライフジャケットを着た人々のグループは、速い川でゴムのディンギをrowいでいます。 サングラスをかけた男性が、女性の手を白いドレスに着けて道を歩いています。 : A woman with a bag is walking on a busy street)一个双臂抬起的运动员跪在绿茵茵的球场上(Trans : An athlete with raised arms kneels on a green field)展板前的桌子前一个戴着眼镜的男人旁有一个双手相握的男人在讲话(Trans : A man with clasped hands is speaking next to a man with glasses at a table in front of a display board)score = 34.91score = 30.88 score = 23.377 score = 33.03 score = 30.69 score = 29.40 score = 30.90 score = 32.48score = 29.65 score = 28.20 score = 28.01 score = 32.60 score = 29.50score = 34.04score = 28.29 QueryModel Outputs 一个穿红色衣服的男人坐在鳄鱼身上用手摸着它的嘴巴(Trans : A man in red sits on a crocodile スコア = 29.63(根拠真理) スコア = 25.66スコア = 23.344 (18)? 赤の人が他の人と作用している)score = 30.21(根拠真理)score = 27.61score = 26.39
0.55
英語(論文から抽出)
日本語訳
スコア
Figure 8: Case study for text-to-code retrieval.
図8:テキストからコードへの検索のケーススタディ。
0.70
For each query, we show top-3 returned codes and the relevance scores returned by SkillNet.
各クエリについて、トップ3の返却コードと、SkillNetが返送した関連スコアを示す。
0.56
modalities. On Chinese text-to-image retrieval, our final system yields better accuracy with less activated parameters compared to existing leading systems.
International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5.
International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) 1-5頁。 訳抜け防止モード: 音声データベースと音声I/Oシステムに関する国際調整委員会 and Assessment (O - COCOSDA ) , page 1-5 。
0.69
IEEE, 2017.
2017年、IEEE。
0.63
[10] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu.
[10]クイ、ワンシャン・チェ、チン・リウ、ビン・チン、シジン・ウォン、グーピング・フー。
0.45
Revisiting pre-trained In Findings of the Association for Computational models for Chinese natural language processing.
中国語自然言語処理のための計算モデル研究会の事前学習結果の再検討
0.73
Linguistics: EMNLP 2020, pages 657–668, Online, November 2020.
言語学:emnlp 2020, pages 657–668, online, november 2020。
0.86
Association for Computational Linguistics. [11] Jeff
計算言語学会会員。 11]ジェフ
0.50
Dean. Google
ディーン Google
0.36
A architecture. https://blog.google/ technology/ai/
建築。 https://blog.google/ technology/ai/
0.33
next-generation ai In introducing-pathways -next-generation-ai- architecture/.
次世代 アイ in introduction-pathway s-next-generation-ai -architecture/
0.45
2021. URL Blog,
2021. URL ブログ
0.54
Introducing pathways: [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al An image is worth 16x16 words: Transformers for image recognition at scale.
13] alexey dosovitskiy, lucas beyer, alexander kolesnikov, dirk weissenborn, xiaohua zhai, thomas unterthiner, mostafa dehghani, matthias minderer, georg heigold, sylvain gelly, et al an image is worth 16x16 words: transformers for image recognition at scale。 訳抜け防止モード: [13 ]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer Georg Heigold, Sylvain Gelly, et al Image is worth 16x16 words : Transformer for image Recognition at scale。
0.41
arXiv preprint arXiv:2010.11929, 2020.
arxiv プレプリント arxiv:2010.11929, 2020
0.44
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.
[15] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al Glam: Efficient scaling of language models with mixtureof-experts.
[15] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al Glam: 言語モデルと専門家の混在による効率的なスケーリング。
0.90
arXiv preprint arXiv:2112.06905, 2021.
arXiv preprint arXiv:2112.06905, 2021
0.40
[16] William Fedus, Barret Zoph, and Noam Shazeer.
16]ウィリアム・フェドゥス、バレット・ゾフ、ノーム・シャイザー
0.54
Switch transformers: Scaling to trillion parameter models
switch transformers: 数兆のパラメータモデルへのスケーリング
0.81
with simple and efficient sparsity.
シンプルで効率的な空間で
0.60
arXiv preprint arXiv:2101.03961, 2021.
arXiv preprint arXiv:2101.03961, 2021
0.40
[17] Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, et al Wenlan 2.0: Make ai imagine via a multimodal foundation model.
[17]南y Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, et al Wenlan 2.0: マルチモーダル基盤モデルを通じて,aiを想像する。
0.87
arXiv preprint arXiv:2110.14378, 2021.
arXiv preprint arXiv:2110.14378, 2021
0.40
[18] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al Codebert: A pre-trained model for programming and natural languages.
18] zhangyin feng, daa guo, duyu tang, nan duan, xiaocheng feng, ming gong, linjun shou, bing qin, ting liu, daxin jiang, et al codebert: プログラミングと自然言語のための事前学習されたモデル。
0.75
arXiv preprint arXiv:2002.08155, 2020.
arxiv プレプリント arxiv:2002.08155, 2020
0.44
[19] Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra.
[19]Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra。
0.37
Omnivore: A single model for many visual modalities.
omnivore: 多くの視覚モダリティのための単一のモデル。
0.82
arXiv preprint arXiv:2201.08377, 2022.
arXiv preprint arXiv:2201.08377, 2022
0.40
[20] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber.
Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber. [20] Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber.
0.33
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[22]開明、Xiangyu Zhang、Shaoqing Ren、Jian Sun。
0.52
Deep residual learning for image recognition.
画像認識のための深い残差学習
0.81
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Proceedings of the IEEE conference on computer vision and pattern recognition, page 770–778, 2016 訳抜け防止モード: In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770-778頁、2016年。
0.83
[23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.
[24] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.
Geoffrey Hinton氏、Li Deng氏、Dong Yu氏、George E Dahl氏、Abdel-rahman Mohamed氏、Navdeep Jaitly氏、Andrew Senior氏、Vincent Vanhoucke氏、Patrick Nguyen氏、Tara N Sainath氏、そして、音声認識における音響モデルのためのDeep Neural Network氏。
0.69
IEEE Signal processing magazine, 29(6):82–97, 2012.
IEEE Signal Processing Magazine, 29(6):82-97, 2012
0.45
[25] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed.
Deep learning–based text classification: a comprehensive review.
deep learning-based text classification: 包括的なレビュー。
0.79
ACM Computing Surveys (CSUR), 54(3): 1–40, 2021.
ACM Computing Surveys (CSUR), 54(3): 1–40, 2021。
0.84
[35] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al Language
[35]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al Language 訳抜け防止モード: アレック・ラドフォード, ジェフリー・ウー, レウォン・チャイルド. David Luan, Dario Amodei, Ilya Sutskever, et al Language
0.67
models are unsupervised multitask learners.
モデルは教師なしマルチタスク学習者です
0.65
OpenAI blog, 1(8):9, 2019.
OpenAI blog, 1(8):9, 2019。
0.90
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al Learning transferable visual models from natural language supervision.
[41] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang.
[41]新王、江英、陳順君、李礼、元王、ウィリアム・ヤン王。
0.57
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research.
vatex: ビデオ言語研究のための大規模で高品質な多言語データセット。
0.63
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, page 4581–4591, 2019。
0.46
[42] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
42] トーマス・ウルフ、リサンドル・デビュー、ジュリアン・シャウモンド、クレメント・ドラング、アンソニー・モイ、ピアリック・シスタック、ティム・ロート、レミ・ルーフ、モーガン・ファントウィッツ、ジョー・デービソン、サム・シュレイファー、パトリック・フォン・プラテン、クララ・マ、ヤシネ・イェルナイト、ジュリアン・プル、カンウェン・xu、ティブン・ル・スカオ、シルヴァイン・グッガー、マリアマ・ドラメ、クエンティン・リュエスト、アレクサンダー・m・ラッシュ 訳抜け防止モード: [42 ]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger マリアマ・ドレーム(Mariama Drame)、クエンティン・ロースト(Quentin Lhoest)、アレクサンダー・M・ラッシュ(Alexander M. Rush)。
0.93
Huggingface’s transformers: State-of-the-art natural language processing.
Huggingfaceのトランスフォーマー:最先端の自然言語処理。
0.74
arXiv preprint arXiv:1910.03771, 2020.
arXiv preprint arXiv:1910.03771, 2020
0.40
[43] Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, et al Ai challenger: A large-scale dataset for going deeper in image understanding.
[43]Jeahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, et al Ai Challenger: 画像の理解を深めるための大規模なデータセット。
0.82
arXiv preprint arXiv:1711.06475, 2017.
arxiv プレプリント arxiv:1711.06475, 2017
0.44
[44] Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al Clue: A chinese language understanding evaluation benchmark.
44] Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al Clue: 中国語理解評価ベンチマーク。 訳抜け防止モード: 44] 梁 周 拝 フ 玄武 張 ] ル・リ、チェンジー・カオ、ユドン・リ、イェチェン・xu kai sun, dian yu, cong yu, et al clue : 中国語理解評価ベンチマーク。
0.53
arXiv preprint arXiv:2004.05986, 2020.
arxiv プレプリント arxiv:2004.05986, 2020
0.43
[45] W Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al Bridging the gap between human and machine translation.
45] w yonghui, mike schuster, zhifeng chen, quoc v le, mohammad norouzi, wolfgang macherey, maxim krikun, yuan cao, qin gao, klaus macherey, et al 人間の翻訳と機械翻訳のギャップを埋める。
0.63
arXiv preprint arXiv:1609.08144, 2016.
arxiv プレプリント arxiv:1609.08144, 2016
0.40
[46] Fan Zhang, Duyu Tang, Yong Dai, Cong Zhou, Shuangzhi Wu, and Shuming Shi.
Since some videos are unavailable for they are deleted or hidden by either YouTube or the users, we actually obtain 23,453 videos for training and 2,709 videos for validation.
Then, we split each image into 196 patches with the patch size of 16 × 16, which are sent into a 3 in-channel and 768 out-channel 2D-convolution with kernel size of (16, 16) and stride step of (16, 16).
For video, we truncate each video to no more than 10 seconds and transform each video into frames by 3 frames per second.
ビデオの場合、各動画を10秒未満に切り刻み、各動画を毎秒3フレームのフレームに変換する。
0.71
Then, we randomly sample 6 frames for each video.
そして、各ビデオの6フレームをランダムにサンプリングする。
0.79
At last, 6 video frames after cropping and normalizing are sent into a 3 in-channel and 768 output-channel 3D-convolution with a kernel size of (3, 16, 16) and stride step (3, 16, 16).
There are different ways to initialize the model parameters.
モデルパラメータを初期化する方法は様々である。
0.87
To accelerate the training process, instead of training from random initialization, we use ViT-B/16 from CLIP [36] to initialize image-related parameters and initialize other parameters from scratch.
Since different modalities have different memory costs, we set the batch sizes as 512/1024/3072/1024/5 12 for text/sound/image/vid eo/code to maximize the memory usage of GPUs.
メモリコストが異なるため、バッチサイズを512/1024/3072/1024/5 12 for text/sound/image/vid eo/code to max the memory usage of GPUs。
0.69
We observe that sound and code modalities require longer training steps to converge and the data scale of video is smaller than other modalities.
For image, we download the Wukong dataset [21] which originally includes 101,483,885 text-image pairs and filter out low-quality instances that with no Chinese words, too many illegal symbols and the length of captions is less than 4.
We finally use about 84,000,000 text-image pairs for pretraining.
最後に、トレーニングに8万4000,000のテキストイメージペアを使用しました。
0.42
For video, we use WebVid-2M [7], which comprises of over two million video-text pairs scraped from the internet.
ビデオにはWebVid-2M[7]を使用します。 訳抜け防止モード: ビデオでは webvid-2 m [7 ] を使い インターネットから削除された200万以上のビデオとテキストペアで構成される。
0.63
We translate the original English texts to Chinese by the translation tool Transmart and use the translated data for pretraining.
翻訳ツールTransmartにより、原文を中国語に翻訳し、翻訳データを用いて事前学習を行う。
0.77
For code pretraining, we hold out 800,000 code-text pairs from the aforementioned code dataset translated from PyTorrent, which have no overlaps with the datasets used for the downstream task of text-code retrieval.