We present a novel attention mechanism: Causal Attention (CATT), to remove
the ever-elusive confounding effect in existing attention-based vision-language
models. This effect causes harmful bias that misleads the attention module to
focus on the spurious correlations in training data, damaging the model
generalization. As the confounder is unobserved in general, we use the
front-door adjustment to realize the causal intervention, which does not
require any knowledge on the confounder. Specifically, CATT is implemented as a
combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention
(CS-ATT), where the latter forcibly brings other samples into every IS-ATT,
mimicking the causal intervention. CATT abides by the Q-K-V convention and
hence can replace any attention module such as top-down attention and
self-attention in Transformers. CATT improves various popular attention-based
vision-language models by considerable margins. In particular, we show that
CATT has great potential in large-scale pre-training, e.g., it can promote the
lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less
computational power, comparable to the heavier UNITER~\cite{chen2020uniter}.
Code is published in \url{https://github.com/y angxuntu/catt}.
3Faculty of Information Technology, Monash University, Australia,
3オーストラリア、モナッシュ大学情報技術学部
0.59
2Futurewei Technologies
2Futurewei技術
0.84
s170018@e.ntu.edu.sg , hanwangzhang@ntu.edu .sg, guojunq@gmail.com, Jianfei.Cai@monash.e du
s170018@e.ntu.edu.sg , hanwangzhang@ntu.edu .sg, guojunq@gmail.com, Jianfei.Cai@monash.e du
0.57
1 2 0 2 r a M 5 ] V C .
1 2 0 2 r a m 5 ] v c である。
0.80
s c [ 1 v 3 9 4 3 0 .
s c [ 1 v 3 9 4 3 0 .
0.85
3 0 1 2 : v i X r a
3 0 1 2 : v i X r a
0.85
Abstract We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models.
This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization.
As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder.
Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention.
具体的には,(1)IS-ATT(In-Sample Attention)と(2)CS-ATT(Cross-Samp le Attention)の組み合わせとして,CATTが実施される。
0.69
CATT abides by the QK-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers.
CATT improves various popular attention-based vision-language models by considerable margins.
CATTは、様々な注目に基づく視覚言語モデルを大幅に改善する。
0.62
In particular, we show that CATT has great potential in large-scale pre-training, e g , it can promote the lighter LXMERT [61], which uses fewer data and less computational power, comparable to the heavier UNITER [14].
Code is published in https://github.com/y angxuntu/catt.
コードはhttps://github.com/y angxuntu/cattで公開されている。
0.38
1. Introduction Stemming from the strong cognitive evidences in selective signal processing [64, 54], the attention mechanism has arguably become the most indispensable module in vision and language models [71, 5, 3, 16, 11, 39].
Although its idiosyncratic formulation varies from task to task, its nature can be summarized as the following common Q-K-V notation: given a query q, the attention mechanism associates q to each feature value vi by using the normalized attentive weight αi ∝ qT ki, where ki is the key function of the — is(cid:80) value; thus, the resultant selective feature value — attention i αivi.
クエリqが与えられたとき、注意機構はqを各特徴値viに関連付け、正規化注意重み(正規化注意重み(英語版)(正規化注意重み(英語版)(remized attentive weight αi ) qt ki)を用いて、kiは---(cid:80) 値のキー関数であり、結果として選択的な特徴値—注意i αivi(英語版)(注意i αivi) となる。 訳抜け防止モード: その慣用的定式化はタスクごとに異なるが、その性質は以下の共通 Q - K - V 表記として要約できる。 注意機構は、正規化減衰重量 αi > qT Ki を用いて、q を各特徴値 vi に関連付ける。 ここで Ki は — (cid:80 ) 値のキー関数です。 結果として得られる選択的特徴値 - attention i αivi である。
0.78
In a modern view, the attention can be understood as a feature transformer that encodes input query
現代の見方では、注意は入力クエリをエンコードする特徴トランスとして理解することができる。
0.66
Figure 1. Top: an example of image captioner with a self-attention and a top-down attention modules.
The reason why the prediction is “riding” but not “driving” is explained in Figure 3. q by using the given values V = {vi} [65].
予測が"ライド"だが"運転"ではない理由は、与えられた値 v = {vi} [65] を用いて図3.qで説明されます。
0.78
Taking image captioning as an example in Figure 1, if q and V are both encoded from the input X, e g , the RoI features of an image, we call it self-attention; if q is changed to the sentence context, we call it top-down attention.
Intuitively, self-attention is usually viewed as a nonlocal [70] (or graph [7]) convolution network that enriches each local value with global relationship features; top downattention is used to enrich the context with the cross-domain relationship features [3].
As a bridge connecting the input feature X and the output label Y , the quality of attention — how reasonable the attentive weight α is — plays a crucial role for the overall performance.
However, due to the fact that the attention weights are unsupervised, e g , there is no wordregion grounding for the top-down attention or relationship dependency annotation for the self-attention, the weights will be inevitably misled by the dataset bias.
For example, as shown in Figure 1, since there are many images captioned with “person riding horse” in the training data, self-attention learns to infer “riding” by building the dependency between “person” and “horse”.
Then, given a test image with “person driving carriage”, this self-attention still
そして「人走車」のテスト画像から、この自己意識は今でも残っています。
0.59
1 QKVATTKVQATTamanisam anisridingXZYSelf-To p-DownCausal Graph of Attention
1 qkvattkvqattamanisri dingxzyself-top-down causal graph of attention
0.57
英語(論文から抽出)
日本語訳
スコア
Figure 2. Before pre-training (e g , LXMERT [61]), attentions are correct (blue).
図2。 事前トレーニング(例えば、LXMERT [61])の前に、注意は正しい(青)。
0.71
After pre-training, attentions are wrong (red).
事前訓練の後、注意は間違っている(赤)。
0.63
This is because the co-occurrences of some concepts appear much more often than others, e g , “Sport+Man” appears 213 times more than “Sport+Screen” in the pre-training data.
tends to relate “person” with “horse” to infer “riding”, but ignoring the “carriage”.
人」と「馬」を関連付けて「乗り」を推論する傾向がありますが、「輸送」を無視します。
0.58
Unfortunately, such bias cannot be mitigated by simply enlarging the dataset scale, as most of the bias abides by the data nature — Zipf’s law [51] and social conventions [19] — there are indeed more “red apple” than “green apple” or “person standing” than “person dancing”.
残念ながら、データセットのスケールを拡大するだけではそのようなバイアスを緩和することはできません。データの本質 - Zipfの法則[51]と社会慣習[19] - のほとんどは「人踊る」よりも「緑リンゴ」または「人立っている」よりも確かに多くの「赤リンゴ」があります。 訳抜け防止モード: 残念ながら、データセットの規模を拡大すれば、そのようなバイアスを軽減することはできない。 偏見はデータの性質に左右され Zipf の法則 [51 ] と社会慣例 [19 ] は、実は “Green apple ” よりも “ red apple ” の方が多い。 あるいは“ダンスをする人”よりも“立っている人”です。
0.75
Therefore, as shown in Figure 2, large-scale pretraining may lead to even worse attentions.
したがって、図2に示すように、大規模な事前訓練は、さらに注意を引く可能性がある。
0.58
is esThe dataset bias sentially caused by the confounder, a common cause that makes X and Y correlated even if X and Y have no direct causation.
これは、XとYが直接因果関係を持たない場合でもXとYが相関する共通の原因です。 訳抜け防止モード: データセットのバイアスは、共同創業者によって引き起こされます。 X と Y が直接因果関係を持たない場合でも X と Y は相関する。
0.74
We illustrate this crucial idea in Figure 3.
この重要なアイデアを図3に示します。
0.75
Suppose that the confounder C is the common sense1 “person can ride horse”, C → X denotes that a visual scene is generated by such knowledge, e g , the dataset curator observes and captures the common sense; X → M denotes the fact that the objects M = {person, horse} can be detected (e g , Faster R-CNN [52]), whose object inventory is determined by C → M; M → Y denotes the language generation for “person riding horse”.
共同設立者 C が「乗馬できる人」であるとするならば、C → X はそのような知識によって視覚シーンが生成されること、例えばデータセットキュレーターが共通の感覚を観察して捉えていること、X → M は対象 M = {person, horse} が検出できるという事実(例えば、Faster R-CNN [52] )、その対象の在庫は C → M によって決定され、M → Y は「乗馬」の言語生成を表す。
0.84
Note that besides the legitimate causal path from image X via object M to Y , the “backdoor” path X ← C → M → Y also contributes an effect to Y .
オブジェクトMを介して画像XからYへの正当な因果パスに加えて、 "バックドア"パスXは、C → M → YもYに効果をもたらすことに注意してください。
0.73
Therefore, if we only train the model based on the correlation P (Y |X) without knowing the confounding effect, no matter how large the amount of training data is, the model can never identify the true causal effect from X to Y [44, 56].
For example, if the confounder distribution varies from training to testing, e g , the common sense “person can ride horse” is dominantly more often than the common sense “person can drive carriage” in training, but the latter is more often than the former in testing, then P (Y |X) based on “person can ride horse” in training will be no longer applicable in testing [45].
In this paper, we propose a novel attention mechanism called: Causal Attention (CATT), which can help the mod1It is also well-known as the disentangled causal mechanism [60].
Figure 3. This expands the causal of confounding path the X (cid:76)(cid:57)(cid :57)(cid:57)(cid:57) (cid:75) Y in Figure 1
図3。 これは、図 1 における x (cid:76)(cid:57)(cid :57)(cid:57)(cid:57) (cid:75) y の連結経路の因果を広げる。
0.69
links . els identify the causal effect between X and Y , and thus mitigates the bias caused by confounders.
リンク . el は X と Y の間の因果関係を識別し、共同設立者によるバイアスを軽減する。
0.74
It is based on the front-door adjustment principle that does not require the assumption of any observed confounder [43], and thus CATT can be applied in any domain where the attention resides.
In this way, CATT is fundamentally different from existing deconfounding methods based on the backdoor adjustment [83, 69], which has to be domain-specific to comply with the observed-confounder assumption.
Specifically, we first show that the conventional attention is indeed an improper approximation of the front-door principle, and then we show what is a proper one, which underpins CATT theoretically (Section 3.1).
In particular, the parameters of the Q-K-V operations can also be shared between both IS-ATT and CS-ATT to further improve the efficiency in some architectures.
We replace the the conventional attention with CATT in various vision-language models to validate its effectiveness, including the classic Bottom-Up Top-Down LSTM [3], Transformer [65], and a large-scale vision-language pre-training (VLP) model LXMERT [61].
The experimental results demonstrate that our CATT can achieve consistent improvements for all of them.
実験の結果,CATTはこれらすべてに対して一貫した改善を達成できることがわかった。
0.62
Significantly, our light LXMERT+CATT outperforms the heavy UNITER [14] on VQA2.0, i.e., 73.04% vs. 72.91% on test-std split, and NLVR2, i.e., 76.0% vs. 75.80% on test-P split, while we require much fewer pretraining burdens: 624 vs. 882 V100 GPU hours.
They can be summarized as the query, key, value (Q-K-V) operation that also generalizes to self-attention [65, 70], which even be applied in pure vision tasks such as visual recognition and generation [11, 12].
As the attention weight is unsupervised, it is easily misled by the confounders hidden in the dataset.
注意の重みは監視されていないため、データセットに隠された共同設立者によって容易に誤解される。
0.54
We exploit the causal inference to propose a novel CATT module to mitigate the confounding effect [47, 44].
因果推論を利用して,新しいCATTモジュールを提案し,結束効果を緩和する [47, 44] 。
0.77
As our proposed CATT complies with the Q-K-V convention, it has great potential in any model that uses attention.
提案したCATTはQ-K-V規約に準拠するので,注目するモデルにおいても大きな可能性を秘めている。
0.66
Vision-Language Pre-Training.
Vision-Language Pre-Training
0.59
Inspired by the success of large-scale pre-training for language modeling [16, 50], researchers have developed some multi-modal Transformerbased Vision-Language Pre-training (VLP) models to learn task-agnostic visiolinguistic representations [35, 61, 31, 14, 85, 32, 30].
To discover the visiolinguistic relations across domains, a huge amount of data [57, 13, 28] are required
ドメイン間の相互関係を発見するには,膨大なデータ[57,13,28]が必要である。
0.81
2 Q: What sport is being shown on the screen?A: Dancing (Bowling)Q: What color is the girl's necklace?A: Black (White)Q: What gender is the person standing up?A: Male (Female)#“Sport+Man” / #“Sport+Screen”=213#“Color+Girl” / #“Color+Necklace”=54#“Board+Man” / #“Board+Woman”=20CXMY
2 a: dancing (bowling)q: what color is the girl's neck?a: black (white)q: what gender is the person standing?a: male (female)# “sport+man” / # “sport+screen”=213# “color+girl” / ##color+necklace”=54# “board+man” / #“board+woman”=20cxmy 訳抜け防止モード: 2 Q : 画面に表示されているスポーツとは?A : ダンス(ボーリング)Q : 少女のネックレスの色は?A : 黒(白)Q : 立っている人は何?A : 男性(女性)#「スポーツ+マン」/ # “Sport+Screen”=213# “Color+Girl ” / # “ Color+Necklace”=54# “Board+Man ” / # “ Board+Woman”=20CXMY
0.87
英語(論文から抽出)
日本語訳
スコア
for VLP. However, just as the language pre-training models tend to learn or even amplify the dataset bias [29, 40], these VLP models may also overplay the spurious correlation.
To tackle the sampling challenge in the front-door adjustment, we propose two effective approximations called In-Sample Sampling and Cross-Sample Sampling.
Attention in the Front-Door Causal Graph We retrospect the attention mechanism in a front-door causal graph [47, 44] as shown in the bottom part of Figure 1, where the causal effect is passed from the input set X to the target Y through a mediator Z.
By this graph, we can split the attention mechanism into two parts: a selector which selects suitable knowledge Z from X and a predictor which exploits Z to predict Y .
このグラフにより、注意機構を X から適切な知識 Z を選択するセレクタと、Z を利用して Y を予測する予測器の2つに分割することができる。
0.78
Take VQA as the example, X is a multi-modality set containing an image and a question, then the attention system will choose a few regions from the image based on the question to predict the answer.
We usually use the observational correlation P (Y |X) as the (cid:88) target to train an attention-based model: (cid:123)(cid:122) (cid:125) (cid:124) P (Z = z|X) z IS-Sampling
通常、観測相関 P (Y |X) を (cid:88) ターゲットとし、注意に基づくモデルを訓練する。 (cid:123) (cid:122) (cid:125) (cid:124) P (Z = z|X) z IS-Sampling
0.77
P (Y |Z = z),
P (Y | Z = z)。
0.88
P (Y |X) =
P (Y | X) =
0.96
(1) where z denotes the selected knowledge and IS-Sampling denotes In-Sample sampling since z comes from the current input sample X.
(1) z は選択した知識を表し、IS-Sampling は現在の入力サンプル X から来るため、インサンプルサンプリングを表す。
0.81
However, as discussed in Introduction, since the selection is an unsupervised process, the predictor may be misled by the dataset bias when training it by Eq (1).
In causal terms, this means that the predictor may learn the spurious correlation brought by the backdoor path Z ← X ↔ Y 1 instead of the true causal effect Z → Y , and thus the conventional attention mechanism is not a proper way of calculating the causal effect.
これは因果関係において、予測子は真の因果効果 Z → Y の代わりに、バックドア経路 Z は X は Y 1 であり、従って従来の注意機構は因果効果を計算する適切な方法ではないことを意味する。
0.76
To eliminate the spurious correlation brought by the hidden confounders, we should block the backdoor path be1For convenience, we simplify the notation of the backdoor path X ← C → M → Y shown in Figure 3 to X ↔ Y .
隠れた共同設立者がもたらすスプリアス相関を排除するには、バックドアパスbe1をブロックする必要があります。便宜のために、図3に示すバックドアパスXの表記を単純化します。 訳抜け防止モード: 隠れた共同創設者が引き起こした急激な相関をなくすために、バックドアの経路をブロックすべきである。 図3に示すバックドアパス X > C → M → Y の表記を単純化する。
0.67
tween Z and Y : Z ← X ↔ Y .
Z と Y : Z は X は Y である。
0.69
In this way, we can estimate the true causal effect between Z and Y , which is denoted as P (Y |do(Z)), where do(·) denotes the interventional operation [44].
このようにして、Z と Y の間の真の因果効果を推定することができる。これは P (Y |do(Z)) と表され、do(·) は介入操作 [44] を表す。
0.79
We can cut off the link X → Z to block this backdoor path by stratifying the input variable X into different cases {x} and then measuring the average causal effects of Z on Y by the following expectation [46]: (2)
リンク X → Z を遮断して、入力変数 X を異なるケース {x} に成層化して、次の期待 [46] で Y 上の Z の平均因果効果を測定することで、このバックドアパスをブロックすることができます。
0.81
P (Y |X = x, Z),
P (Y |X = x, Z)
0.83
P (Y |do(Z)) =
P (Y |do(Z)) =
0.94
(cid:88) (cid:123)(cid:122) (cid:125) (cid:124) P (X = x) x CS-Sampling
(cid:88) (cid:123)(cid:122) (cid:125) (cid:124) P (X = x) x CS-Sampling
0.78
where x denotes one possible input case.
xは1つの可能な入力ケースを表します。
0.57
Here we denote it as Cross-Sample Sampling (CS-Sampling) since it comes from the other samples.
ここでは、他のサンプルからのクロスサンプルサンプリング(csサンプリング)と表現する。
0.73
Intuitively, CS-Sampling approximates the “physical intervention” which can break the spurious correlation caused by the hidden confounder.
For example, the annotation “man-with-snowboard” is dominant in captioning dataset [19] and thus the predictor may learn the spurious correlation between the snowboard region with the word “man” without looking at the person region to reason what actually the gender is.
CS-Sampling alleviates such spurious correlation by combining the person region with the other objects from other samples, e g , bike, mirror, or brush, and inputting the combinations to the predictor.
Then the predictor will not always see “man-withsnowboard” but see “man” with the other distinctive objects and thus it will be forced to infer the word “man” from the person region.
By replacing P (Y |z) in Eq (1) by P (Y |do(Z)) in Eq.
Eq (1) の P (Y |z) を Eq の P (Y |do(Z)) で置き換える。
0.84
(2), we can calculate the true causal effect between X and Y : (cid:88) P (Y |do(X)) (cid:123)(cid:122) (cid:125) (cid:124) [P (Y |Z = z, X = x)].
2) X と Y の間の真の因果効果を計算することができる: (cid:88) P (Y |do(X)) (cid:123) (cid:125) (cid:124) [P (Y |Z = z, X = x)]。
0.89
P (Z = z|X) z IS-Sampling (3) This is also called the front-door adjustment, which is a fundamental causal inference technique for deconfounding the unobserved confounder [43].
P (Z = z|X) z IS-Sampling (3) これは、観測されていない共同設立者[43]を分解するための基本的な因果推論手法である、フロントドア調整とも呼ばれる。 訳抜け防止モード: P ( Z = z|X ) z IS - サンプリング ( 3 ) これは別名で呼ばれる。 正面 - ドアの調整 これは 根本的因果推論技術です 未確認の共同設立者[43]を
0.75
Since our novel attention module is designed by using Eq (3) as the training target, we name our attention module as Causal Attention (CATT).
3.2. In-Sample and Cross-Sample Attentions To implement our causal attention (Eq.
3.2. In-Sample and Cross-Sample Attentions to implement our causal attention (Eq。
0.77
(3)) in a deep framework, we can parameterize the predictive distribution P (Y |Z, X) as a network g(·) followed by a softmax layer since most vision-language tasks are transformed into classification formulations [68, 4]:
(cid:88) (cid:123)(cid:122) (cid:124) (cid:125) P (X = x) x CS-Sampling
(cid:88) (cid:123) (cid:122) (cid:124) (cid:125) P (X = x) x CS-Sampling
0.78
= P (Y |Z, X) = Softmax[g(Z, X)].
= P (Y |Z, X) = Softmax[g(Z, X)] である。
0.88
(4) As can be seen in Eq (3), we need to sample X and Z, and feed them into the network to complete P (Y |do(X)).
(4) Eq (3) で見られるように、X と Z をサンプリングし、それらをネットワークに供給して P (Y |do(X)) を完成させる必要がある。
0.86
3
3
0.85
英語(論文から抽出)
日本語訳
スコア
However, the cost of the network forward pass for all of these samples is prohibitively expensive.
しかし、これらのサンプルすべてに対するネットワークフォワードパスのコストは、必然的に高価である。
0.67
To address this challenge, we apply Normalized Weighted Geometric Mean (NWGM) approximation [71, 58] to absorb the outer sampling into the feature level and thus only need to forward the “absorbed input” in the network for once.
z x where h(·) and f (·) denote query embedding functions which can transform the input X into two query sets.
z x ここで、h(·) と f(·) は入力 x を 2 つのクエリ集合に変換できるクエリ埋め込み関数を表す。
0.82
Both of them can be parameterized as networks.
どちらもネットワークとしてパラメータ化できる。
0.79
Note that in a network, the variable X and Z are represented by embedding vectors, e g , an image region becomes an RoI representation, so we use bold symbols to signify these embedding vectors, e g , z, x denote the embedding vectors of the variable z, x.
ネットワーク上では、変数 X と Z は埋め込みベクトル、例えば、画像領域はRoI表現となるので、大胆な記号を用いてこれらの埋め込みベクトル、例えば、z, x は変数 z, x の埋め込みベクトルを表すことに注意されたい。
0.72
ˆX, ˆZ denote the estimations of the ISSampling and CS-Sampling, which can be packed into the matrix form [65].
X は ISS と CS-Sampling の推定を表し、これは行列形式 [65] に詰め込むことができる。
0.66
The derivation details of Eq (5) are given in the supplementary material.
Eq(5)の導出の詳細は補足材料で与えられます。
0.69
Actually, the IS-Sampling estimation ˆZ is what a classic attention network calculates, which can be briefly expressed by the Q-K-V operation as the blue block in Figure 4:
In this case, all the KI and VI come from the current input sample, e g , the RoI feature set.
この場合、すべてのKIとVIは、現在の入力サンプル、例えば、RoI機能セットから来ています。
0.67
QI comes from h(X), e g , in top-down attention, the query vector qI is the embedding of the sentence context and in selfattention, the query set QI is also the RoI feature set.
For AI, each attention vector aI is the network estimation of the IS-Sampling probability P (Z = z|h(X)) and the output ˆZ is the estimated vector set of IS-Sampling in Eq (5).
AI の場合、各注意ベクトル aI は IS-サンプリング確率 P (Z = z|h(X)) のネットワーク推定値であり、出力は Eq (5) における IS-サンプリングの推定ベクトル集合である。
0.86
Inspired by Eq (6), we can also deploy a Q-K-V operation to estimate ˆX and name it as Cross-Sample attention (CS-ATT), which is the red block in Figure 4:
In this way, VC and VI stay in the same representation space, which guarantees that the estimations of IS-Sampling and CS-Sampling: ˆZ and ˆX in Eq (5) also have the same distribution.
このように、VC と VI は同じ表現空間に留まり、IS-Sampling と CS-Sampling: Eq (5) の X と X の見積も同じ分布を持つことを保証している。
0.81
To sum up, as shown in Figure 4, our single causal attention module estimates ˆZ and ˆX respectively by IS-ATT in Eq.
図4に示すように、我々の単一の因果的注意加群は、それぞれ Eq の IS-ATT によって、Z と X を推定する。
0.58
(6) and CS-ATT in Eq (7).
(6)およびEq(7)のCS-ATT。
0.79
After that, we can concatenate the outputs for estimating P (Y |do(X)) as in Eq (5).
その後、P (Y |do(X)) を Eq (5) で推定する出力を連結することができる。
0.80
3.3. CATT in Stacked Attention Networks In practice, attention modules can be stacked as deep networks, e g , the classic Transformer [65] or BERT architectures [16].
3.3. CATT in Stacked Attention Networks 実際には、注意モジュールは、例えば、古典的なトランスフォーマー[65]またはBERTアーキテクチャ[16]などのディープネットワークとして積み重ねることができます。 訳抜け防止モード: 3.3. CATT in Stacked Attention Networks 実際には、注意モジュールはディープネットワークとして積み重ねることができます。 e g, 古典的なトランスフォーマー [ 65 ] または BERT アーキテクチャ [ 16 ]。
0.75
Our CATT can also be incorporated into these stacked attention networks and we experiment with Transformer [65] and LXMERT [61] in this paper.
We briefly introduce their architectures here and discuss the implementation details in Section 4.2.
彼らのアーキテクチャを簡単に紹介し、実装の詳細をセクション4.2で論じる。
0.62
Generally, our CATT replaces the first attention layer of these architectures to get the estimations of IS-Sampling ˆZ and CS-Sampling ˆX, and then we input them into more attention layers for further embedding, as shown in Figure 4.
For convenience, in these stacked attention networks, we still use IS-ATT and CSATT as the names of the attention modules to signify that this attention layer is dealing with the representations of the IS-Sampling or CS-Sampling.
plementations, both the encoder and decoder contain 6 blue and purple blocks.
エンコーダとデコーダは6つの青と紫のブロックを含んでいる。
0.79
The inputs of the encoder include the embedding set of the current image and a global image embedding dictionary.
エンコーダの入力には、現在の画像の埋め込みセットと、グローバル画像埋め込み辞書が含まれる。
0.73
The IS-ATT and CS-ATT outputs of the encoder are input into the decoder for learning visiolinguistic representations.
エンコーダのIS-ATTおよびCS-ATT出力は、視覚言語表現を学ぶためのデコーダに入力される。
0.68
For the decoder, the inputs of the first IS-ATT and CS-ATT are respectively the current sentence embedding set and a global sentence embedding dictionary.
The outputs of the decoder include two parts which respectively correspond to IS-Sampling ˆZ and CSSampling ˆX, which will be concatenated and input into the final predictor.
Importantly, by stacking many CATT layers, the estimated ˆZ and ˆX may not stay in the same representation space due to the non-convex operations in each attention module, e g , the position-wise feed-forward Networks [65].
To avoid this, we share the parameters of IS-ATT and CS-ATT in each CATT and then the outputs of them will always stay in the same representation space, where the detail formations are given in Eq (8).
これを避けるために、各CATTでIS-ATTとCS-ATTのパラメータを共有し、その出力は常に同じ表現空間に留まり、詳細は Eq (8) で与えられる。 訳抜け防止モード: これを避けるため。 私たちは各 CATT で IS - ATT と CS - ATT のパラメータを共有します。 出力は常に同じ表現空間に留まります ここで、詳細形成は Eq (8) で与えられる。
0.75
As a result, the additional attention computation of CATT in LXMERT is O(K ∗ n)/O(n ∗ n) at the first/other layer, where K is the size of the global dictionary and n is the number of word/image sequence.
Figure 6 demonstrates the architecture of our LXMERT+CATT, which contains three parts, a vision encoder with 5 self-CATT modules, a language encoder with 9 self-CATT modules, and a visiolinguistic decoder with 5 blocks where each one contains two crossmodality CATT (CM-CATT) and two self-CATT modules.
Figure 6(b) sketches one cross-modality module used in the top part of the decoder in (c), where the visual signals are used as the queries in both IS-ATT and CS-ATT.
Similar as the original LXMERT [61], we concatenate the outputs of both vision and language streams and input them into various predictors for solving different vision-language tasks.
In implementations, we share the parameters of IS-ATT and CS-ATT in each causal attention module to force their outputs to have the same distributions.
4. Experiments We validated our Causal Attention (CATT) in three architectures for various vision-language tasks: BottomUp Top-Down (BUTD) LSTM [3] for Image Captioning (IC) [13, 38] and Visual Question Answering (VQA) [4], Transformer [65] for IC and VQA, and a large scale visionlanguage pre-training framework LXMERT [61] for VQA, Graph Question Answering (GQA) [22], and Natural Language for Visual Reasoning (NLVR) [59].
4. BottomUp Top-Down (BUTD) LSTM [3] for Image Captioning (IC) [13, 38] and Visual Question Questioning (VQA) [4], Transformer [65] for IC and VQA, and a large vision language pre-training framework LXMERT [61] for VQA, Graph Question Answering (GQA) [22] and Natural Language for Visual Reasoning (NLVR) [59] である。 訳抜け防止モード: 4. 実験 私たちは、さまざまなビジョン - 言語タスクのための3つのアーキテクチャで因果的注意(CATT)を検証しました:ボトムアップトップ - ダウン(BUTD)LSTM [3]イメージキャプション(IC)のための[13, 38]。 そして視覚質問の答え(VQA) [4]、ICおよびVQAのための変圧器 [65]、。 VQAのための大規模なビジョン言語プリトレーニングフレームワークLXMERT [61]。 グラフ質問回答(GQA)[22]、視覚的推論のための自然言語(NLVR)[59]。
0.83
4.1. Datasets MS COCO [13] has 123,287 images and each image is assigned with 5 captions.
This dataset has two popular splits: the Karpathy split [24] and the official test split, which divide the whole dataset into 113, 287/5, 000/5, 000 and 82, 783/40, 504/40, 775 for training/validation/ test, respectively.
There are 80k/40k training/validation images available offline.
80k/40kのトレーニング/評価画像がオフラインで利用できる。
0.46
We exploited the training set to train our
私たちはトレーニングセットを利用してトレーニングしました
0.62
5 Language DecoderVIVIIS-ATTKIK IQIVCVCCS-ATTKCKCQCQ CCurrent SentenceSentence DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VIIS-ATTKIKIQIVCVCCS -ATTKCKCQCQCVIVIIS-A TTKIKIQIVCVCCS-ATTKC KCQCQCVisual EncoderCurrent ImageImage DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VIIS-ATTKIKIQIVCVCCS -ATTKCKCQCQC[VI ]E[VC ]E[VI ]E[VC ]EZ Z ^ Z ^ X X ^ X ^ (a) Self-CATT(b) CM-CATT(c) LXMERT+CATTVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQC[VI ]L[VI ]V[VC ]L[VC ]VVIIS-ATTKIQIVCCS-AT TKCQC[VI ]L[VI ]V[VC ]L[VC ]VCurrent SampleGlobal DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VC Self-CATTImage InputsCM-CATTSelf-CA TTSentence InputsCM-CATTSelf-CA TTSelf-CATT5 9 5 [V ]V[V ]L[V ]V[V ]LLanguage EncoderVIsiolinguist ic Decoder[V ]V[V ]LVisual Encoder
5 Language DecoderVIVIIS-ATTKIK IQIVCVCCS-ATTKCKCQCQ CCurrent SentenceSentence DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VIIS-ATTKIKIQIVCVCCS -ATTKCKCQCQCVIVIIS-A TTKIKIQIVCVCCS-ATTKC KCQCQCVisual EncoderCurrent ImageImage DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VIIS-ATTKIKIQIVCVCCS -ATTKCKCQCQC[VI ]E[VC ]E[VI ]E[VC ]EZ Z ^ Z ^ X X ^ X ^ (a) Self-CATT(b) CM-CATT(c) LXMERT+CATTVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQC[VI ]L[VI ]V[VC ]L[VC ]VVIIS-ATTKIQIVCCS-AT TKCQC[VI ]L[VI ]V[VC ]L[VC ]VCurrent SampleGlobal DictVIVIIS-ATTKIKIQI VCVCCS-ATTKCKCQCQCVI VC Self-CATTImage InputsCM-CATTSelf-CA TTSentence InputsCM-CATTSelf-CA TTSelf-CATT5 9 5 [V ]V[V ]L[V ]V[V ]LLanguage EncoderVIsiolinguist ic Decoder[V ]V[V ]LVisual Encoder
0.82
英語(論文から抽出)
日本語訳
スコア
BUTD and Transformer based VQA systems, and then evaluated the performances on three different splits: offline validation, online test-development, and online test-standard.
We followed LXMERT [61] to collect a large-scale visionlanguage pre-training dataset from the training and development sets of MS COCO, VQA2.0, GQA [22], and Visual Genome [28].
Input: Q, K, V Prob: Ai = Softmax( Single-Head : Hi = AiV W V i ,
入力: Q, K, V Prob: Ai = Softmax(シングルヘッド : Hi = AiV W V i ,
0.84
QW Q i (KW K i )T √
QW Q i (KW K i ) T 。
0.80
d ) (8) ˆV = Embed([H1, ..., H12]W H ),
d ) (8) V = Embed([H1, ..., H12]W H )
0.82
Ouput: where W ∗ i and W H are all trainable matrices; Ai is the soft attention matrix for the i-th head; [·] denotes the concatenation operation, and Embed(·) means the feed-forward network and the residual operation as in [65].
Ouput: W の i および W H がすべて訓練可能な行列であるところ; Ai は i 番目の頭部のための柔らかい注意のマトリックスです; [·] は連結操作を表し、 Embed(·) は [65] のようにフィードフォワードネットワークと残りの操作を意味します。
0.81
The hidden size was set to 768.
隠された大きさは768に設定された。
0.62
Importantly, we shared the parameters between IS-ATT and CS-ATT in each CATT to make the outputs stay in the same representation space.
We followed the original LXMERT [61] to pre-train our LXMERT+CATT architecture by four tasks: masked cross-modality language modeling, masked object prediction, cross-modality image sentence matching, and image question answering.
We pre-trained the model 20 epochs on 4 GTX 1080 Ti with a batch size of 192.
モデル20エポックを4 GTX 1080 Tiで、バッチサイズは192で事前トレーニングしました。
0.75
The pretraining cost 10 days.
プレトレーニングは10日かかります。
0.67
To fairly compare the pre-training GPU hours with UNITER [14], we also carried an experiment by 4 V100 with batch size as 256 and it cost 6.5 days for pre-training.
Red numbers denote the improvements after using our CATT modules.
赤数字はCATTモジュールを使用した後の改善を表します。
0.68
A@Gen↑ A@Attr↑ A@Act↑ Models 41% BUTD 77% 52% 51%+10% 60%+8% BUTD+CATT 85%+8% Transformer 82% 47% 55% Transformer+CATT 64%+9% 56%+9% 92%+10% epochs was 4, and the learning rates were set to 5e−5, 1e−6, and 5e−5, respectively.
A@Gen' A@Attr モデル 41% BUTD 77% 51% 51% +10% 60% +8% BUTD+CATT 85% +8% Transformer 82% 47% 45% Transformer+CATT 64% +9% 56% +9% 92% +10% epochs was 4, and the learning rate were set to 5e−5, 1e−6, 5e−5。
The results are reported in Table 1, where the top and bottom parts list various models which respectively deploy LSTM and Transformer as the backbones.
In this table, B, M, R, C, and S denote BLEU [42], METEOR[6], ROUGE [33], CIDEr-D [66], and SPICE [2], respectively, which evaluate the similarities between the generated and the ground-truth captions.
Compared with two baselines BUTD and Transformer, we can find that BUTD+CATT and Transformer+CATT respectively achieve 3.0-point and 3.2-point improvements on CIDEr-D. More importantly, after incorporating our CATT modules into BUTD and Transformer, they have higher CIDEr-D scores than certain state-of-the-art captioners which deploy more complex techniques.
For example, SGAE exploits scene graphs to transfer language inductive bias or M2Transformer learns multi-level visual relations though additional meshed memory networks.
Bias Measurements. We measured the bias degree of the generated captions in Table 2 to validate that whether our CATT module can alleviate the dataset bias or not.
Apart from them, we also analyze three more specific biases: gender bias, action bias, and attribute bias by calculating the accuracy of these words, which are denoted as A@Gen, A@Attr,
4.3.2 Visual Question Answering (VQA) The top and bottom parts of Table 3 respectively report the performances of various LSTM and Transformer based VQA models, where loc-val, test-dev, and test-std denote the offline local validation, online test-development, and online test-standard splits.
More importantly, the deconfounded BUTD and Transformer outperform certain state-of-the-art models which are better than the original BUTD and Transformer.
Table 4 reports the accuracies of different question types on test-std split.
表4は、テストスタッド分割に関するさまざまな質問タイプの精度を報告します。
0.59
It can be found that the accuracy of number has the largest improvements after using CATT modules, i.e., 4.75-point and 2.75-point for BUTD and Transformer, respectively.
As analyzed in [84], the counting ability depends heavily on the quality of the attention mechanism that a VQA model cannot correctly answer number questions without attending to all the queried objects.
The second row of Figure 7 shows that after incorporating CATT, BUTD and Transformer based VQA models can attend to the right regions for answering the questions.
4.3.3 Vision-Language Pre-training (VLP) Table 5 shows the training burdens and the performances of various large-scale pre-training models on VQA2.0, GQA, and NLVR2.
For ERNIE-VIL [78] and UNITER [14], they both have a BASE version and a LARGE version where BASE (LARGE) uses 12 (16) heads and 768 (1024) hidden units in multi-head product operations.
ERNIE-VIL [78] と UNITER [14] はどちらも、BASE (LARGE) が 12 (16) のヘッドと 768 (1024) の隠蔽ユニットをマルチヘッド製品運用に使用する BASE バージョンと LARGE バージョンを持っている。
0.90
We report the performances of their BASE versions since our model used 12 heads and 768 hidden units.
For NLVR2, we report the performances of UNITER with the same Pair setting as our model.2 From this table, we can see that compared with LXMERT†, our LXMERT†+CATT respectively achieves 0.86, 1.23, 1.6-point improvements on the test-std splits of VQA2.0 and GQA and the test-P split of NLVR2.
For example, compared with UNITER which uses fp16, our LXMERT†+CATT uses fewer GPU hours and the pre-training data, while we have higher performances on 2The details of NLVR2 setting can be found in Table 5 of UNITER [14].
For BUTD, we show the region with the highest attention weight.
BUTDでは、最も注目度の高い領域を示します。
0.61
For Transformer and VLP, the red region has the highest attention weight in top-down attention and the green region is the one most related to the red region in self-attention.
From the results, we can see that after incorporating our CATT module, both BUTD and Transformer generate less biased captions, e g , the accuracies of gender, attribute, and action are respectively improved by 10%, 9%, and 9% when CATT is used in Transformer.
The first row of Figure 7 shows two examples where BUTD and Transformer respectively attend to unsuitable regions and then generate incorrect captions, e g , BUTD attend to the remote region and infer the word “man” due to the dataset bias, while our CATT corrects this by attending
IC:VQA:VLP:BUTD: a man holding a remoteCATT: a hand holding a remoteTF: a bird flying in the skyCATT: a bird sitting on the treeQ: What is gender of the players?BUTD: maleCATT: femaleQ: What is the girl doing?TF: playing frisbeeCATT: fallingQ: How many people are shown?LXMERT: twoCATT: threeQ: What is behind the hydrant?LXMERT: roadCATT: fence
ic:vqa:vlp:butd: a man holding a remotecatt: a hand holding a remotetf: a bird flying in the skycatt: a bird sitting in the skycatt: what is gender of the players?butd: malecatt: femaleq: what are the girl doing?tf: playing frisbeecatt: fallingq: many people are shown? lxmert: twocatt: threeq: what is behind the hydrant?lxmert: roadcatt: fence 訳抜け防止モード: IC : VQA : VLP : BUTD : リモートCATTを握る男 : リモートTFを握る手 : 空を飛ぶ鳥CATT : 木の上に座っている鳥Q:何? 性別はプレイヤーの性別ですか? : maleCATT : femaleQ : 女の子は何をしていますか?TF : フリスビーキャットをプレイ : FallQ : 何人表示されますか?LXMERT : 2CATT : threeQ : 水和剤の背後にあるものは何ですか?
Specifically, we extracted 64 RoI features from each image to guarantee that our model can be trained on 4 1080 Ti GPUs.
具体的には、各画像から64のRoI機能を抽出し、モデルが4 1080 Ti GPUでトレーニングできることを保証します。
0.75
It can be found that after using two insights from UNITER, our LXMERT+CATT↑ can achieve higher performances than UNITER, even though we do not extract 100 RoI features for each image as them.
These comparisons suggest that our CATT has great potential in large-scale VLP.
これらの比較により、CATTは大規模VLPに大きな可能性を秘めています。
0.56
Also, as shown in Table 4, after incorporating CATT into LXMERT, we can observe that the accuracy of Number is further improved: 55.48 vs. 52.63, which suggests that our CATT improves the quality of the attention modules in VLP models.
We did not use the K-means algorithm to initialize the global dictionaries but randomly initialized them.
我々はK平均アルゴリズムをグローバル辞書の初期化に使用せず、ランダムに初期化する。
0.69
We shared the parameters between IS-ATT and CS-ATT in these models.
これらのモデルではIS-ATTとCS-ATTのパラメータを共有した。
0.57
CATT w/o Share: We did not share the parameters between IS-ATT and CS-ATT.
CATT w/o Share: IS-ATTとCS-ATTのパラメータは共有していません。
0.68
Here we used the K-means algorithm to initialize the dictionaries.
ここでは、辞書の初期化にK-meansアルゴリズムを用いた。
0.67
CATT+D#K: We set the size of the global image and word embedding dictionaries to K by the K-means algorithm and shared the parameters between ISATT and CS-ATT.
Firstly, we can observe that after using our CATT architecture, even without K-means initialization or parameter sharing, the performances are better than Base models.
For example, in LXMERT+CATT, after using K-means and sharing the parameters, the performances of VQA are respectively boosted: 70.40 vs. 69.81 and 70.40 vs. 70.05.
Such observation suggests that both strategies encourage the estimated IS-Sampling and CS-Sampling to stay in the same representation space, which is indeed beneficial in improving the performances.
5. Conclusion In this paper, we exploited the causal inference to analyze why the attention mechanism is easily misled by the dataset bias and then attend to unsuitable regions.
We discovered that the attention mechanism is an improper approximation of the front-door adjustment and thus fails to capture the true causal effect between the input and target.
Then a novel attention mechanism: causal attention (CATT) was proposed based on the front-door adjustment, which can improve the quality of the attention mechanism by alleviating the ever-elusive confounding effect.
Specifically, CATT contains In-Sample and Cross-Sample attentions to estimate In-Sample and Cross-Sample samplings in the front-door adjustment and both of two attention networks abide by the Q-K-V operations.
We implemented CATT into various popular attention-based vision-language models and the experimental results demonstrate that it can improve these models by considerable margins.
For example, C is the cause of both X and Y , thus it is a confounder which will induce spurious correlation between X and Y to disturb the recognition of the causal effect between them.
In particular, such spurious correlation is brought by the backdoor path created by the confounder.
特に、このようなスプリアス相関は、共同設立者が作成したバックドアパスによってもたらされる。
0.51
Formally, a backdoor path between X and Y is defined as any path from X to Y that starts with an arrow pointing into X.
形式的には、X と Y の間のバックドアパスは、X を指す矢印で始まる X から Y への任意のパスとして定義される。
0.75
For example, in Figure 8(a), the path X ← C → Y is a backdoor path.
例えば、図8(a) において、経路 X > C → Y はバックドアパスである。
0.77
Here we use another two examples for helping understand this concept, as in Figure 8(b), X ← C → Y ← Z and Z ← X ← C → Y are two backdoor paths between X and Z and Z and Y , respectively.
ここでは、図8(b) に示すように、この概念を理解するのに役立つ別の2つの例を用いており、X > C → Y > Z と Z > X > C → Y は、それぞれ X と Z と Z と Y の間のバックドアパスである。
0.73
In an SCM, if we want to deconfound two variables X and Y to calculate the true causal effect, we should block every backdoor path between them [47].
scmでは、2つの変数 x と y を分離して真の因果効果を計算する場合は、それらの間のすべてのバックドアパスをブロックする必要があります [47]。
0.70
For example, in Figure 8(a), we should block X ← C → Y to get the causal effect between X and Y .
例えば、図8(a)では、XとYの間の因果効果を得るために、X → C → Yをブロックする必要があります。
0.77
6.2. Blocking Three Junctions In an SCM, there are three elemental “junctions” which construct the whole graph and we have some basic rules to block them.
This is called confounding junction which induces spurious correlation between X and Y , as shown in Figure 8(a).
これは、図8(a)に示すように、X と Y の間の急な相関を誘導する共起接合と呼ばれる。
0.70
In this junction, once we know what the value of C is or directly intervene it to a specific value, there is no spurious correlation between X and Y and thus we block this junction.
Once we know what the value of Y is, Z and C are correlated.
Y の値が何であるかが分かると、Z と C は相関する。
0.81
However, if we do not know what Y is or do not intervene it, Z and C are independent and this junction is naturally blocked.
しかし、もし Y が何であるかがわからなければ、Z と C は独立であり、この接合は自然にブロックされる。
0.72
To sum up, if we want to block a path between two variables, we should intervene the middle variables in the chain and confounding junctions and should not intervene in the collider junction.
To block a long path, we only need to block a junction of it, e g , for X ← C → Y ← Z in Figure 8(b), we can block X ← C → Y by intervening C or block C → Y ← Z by not intervening Y .
長い経路を塞ぐためには、図 8(b) の X と C → Y と Z の接合をブロックする(例えば、図 8(b) のとき、X と C → Y をインターベントしたり、Y を介さずに C → Y と Z をブロックする)必要がある。
0.74
6.3. The Backdoor Adjustment The backdoor adjustment is the simplest formula to eliminate the spurious correlation by approximating the “physical intervention”.
Formally, it calculates the average causal effect of one variable on another at each stratum of the confounder.
形式的には、共同創設者の各層における1つの変数の平均因果効果を計算する。
0.71
For example, in Figure 8(a), we can calculate the causal effect of X on Y as P (Y |do(X)): P (Y |do(X)) = (9) where do(·) signifies that we are dealing with an active intervention rather than a passive observation.
例えば、図8(a)では、Y 上の X の因果効果を P (Y |do(X)) = (9) として計算することができる: P (Y |do(X)) = (9) ここで do(·) は、受動的観察ではなく、アクティブな介入を扱うことを意味する。
0.84
The role of Eq. (9) is to guarantee that in each stratum c, X is not affected by C and thus the causal effect can be estimated stratum by stratum from the data.
Eqの役割。 (9) は、各層 c において x が c に影響されないことを保証するため、そのデータから層によって因果効果を推定することができる。
0.70
6.4. The Front-door Adjustment From Eq (9), we find that to use the backdoor adjustment, we need to know the details of the confounder for splitting it into various strata.
However, in our case, we have no idea about what constructs the hidden confounders in the dataset, thus we are unable to deploy the backdoor adjustment.
Fortunately, the front-door adjustment [43] does not require any knowledge on the confounder and can also calculate the causal effect between X and Y in a front-door SCM as in Figure 8(b).
In Section 3.1 of the submitted manuscript, we have shown the derivation of the front-door adjustment from the attention mechanism perspective.
提出された原稿第3.1節では,注意機構の観点からフロントドア調整の導出を示す。
0.74
Here we demonstrate a more formally derivation.
ここでは、より正式な導出を示す。
0.65
The front-door adjustment calculates P (Y |do(X)) in the front-door X → Z → Y by chaining together two partially causal effects P (Z|do(X))
フロントドア調整は、2つの部分因果効果 P(Z|do(X)) を連結することにより、フロントドアX → Z → Y における P(Y |do(X)) を計算する。 訳抜け防止モード: 前-ドア調整は、前-ドアX → Z → Y で P ( Y |do(X ) ) を計算する 2つの部分因果効果P(Z|do(X))の連鎖
0.82
P (Y |X, C = c)P (C = c),
P (Y |X, C = c)P (C = c)
0.84
(cid:88) c
(cid:88) c
0.82
XYCXYCXYCZXYCZ(a) Backdoor Model(b) Front-door Model
XYCXYCXYCZXYCZ(a)バックドアモデル(b)フロントドアモデル
0.87
英語(論文から抽出)
日本語訳
スコア
z x P (Z = z|X)
z x P (Z = z|X)
0.87
(cid:88) and P (Y |do(Z)): P (Y |do(X)) = P (Z = z|do(X))P (Y |do(Z = z)).
(cid:88) と P (Y |do(Z)): P (Y |do(X)) = P (Z = z|do(X))P (Y |do(Z = z))。
0.97
(10) To calculate P (Z = z|do(X)), we should block the backdoor path X ← C → Y ← Z between X and Z.
(10) P (Z = z|do(X)) を計算するには、X と Z の間のバックドアパス X ・ C → Y ・ Z をブロックすべきである。
0.81
As we discussed in Section 6.2 that a collider junction is naturally blocked and here C → Y ← Z is a collider, thus this path is already blocked and we have: P (Z = z|do(X)) = P (Z = z|X).
第6.2節で述べたように、衝突器接合は自然にブロックされ、ここで c → y, z は衝突器であるため、この経路は既にブロックされており、p (z = z|do(x)) = p (z = z|x) である。
0.71
(11) For P (Y |do(Z)), we need to block the backdoor path Z ← X ← C → Y between Z and Y .
(11) P (Y |do(Z)) に対して、Z と Y の間のバックドアパス Z をブロックする必要がある。
0.71
Since we do not know the details about the confounder C, we can not use Eq (9) (cid:88) to deconfound C. Thus we have to block this path by intervening X: P (Y |Z = z, X = x)P (X = x).
P (Y |do(Z = z)) = (12) At last, by bringing Eq (11) and (12) into Eq (10), we have: (cid:88) (cid:88) P (Y |do(X)) P (X = x)[P (Y |Z = z, X = x)], = x z (13) which is the front-door adjustment given in Eq (3) of the submitted manuscript.
P (Y |do(Z = z)) = (12) 最後に、Eq (11) と (12) を Eq (10) に持ち込むと、次のようになる: (cid:88) (cid:88) P (Y |do(X)) P (X = x)[P (Y |Z = z, X = x)], = x z (13) これは、提出された原稿の Eq (3) で与えられるフロントドア調整である。
0.89
7. Formula Derivations Here we show how to use Normalized Weighted Geometric Mean (NWGM) approximation [71, 58] to absorb the sampling into the network for deriving Eq (5) in the submitted manuscript.
Before introducing NWGM, we first (cid:88) revisit the calculation of a function y(x)’s expectation according to the distribution P (x): (14) y(x)P (x), x which is the weighted arithmetic mean of y(x) with P (x) as the weights.
(15) WGM(y(x)) = where the weights P (x) are put into the exponential terms.
(15) WGM(y(x)) = ここで、重みP(x)は指数項に入れられる。
0.73
(cid:89) If y(x) is an exponential function that y(x) = exp[g(x)], we have: (cid:89) WGM(y(x)) = (cid:88) x exp[g(x)]P (x) = exp[g(x)P (x)] = x g(x)P (x)] = exp{Ex[g(x)]}, = exp[ x where the expectation Ex is absorbed into the exponential term.
Based on this observation, researchers approximate the expectation of a function as the WGM of this function in the deep network whose last layer is a Softmax layer [71,
(3) of the submitted manuscript) as a predictive function and parameterize it by a network with a Softmax layer as the last layer: P (Y |X, Z) = Softmax[g(X, Z)] ∝ exp[g(X, Z)].
提出された原稿の(3))は予測関数として、最後の層としてソフトマックス層を持つネットワークによってパラメータ化される: p (y |x, z) = softmax[g(x, z)] ] exp[g(x, z)]。 訳抜け防止モード: (3) 予測機能として提出された写本 そして、それを最後の層としてソフトマックス層を持つネットワークでパラメータ化する: P ( Y | X, Z ) = Softmax[g(X, X)] Z ) ] > exp[g(X , Z ) ]
0.91
(18) Following Eq (3) of the manuscript and Eq (17), we (cid:88) (cid:88) have: P (Y |do(X)) = x z =E[Z|X]E[X][P (Y |Z, X)] ≈ WGM(P (Y |Z, X)) ≈ exp{[g(E[Z|X][Z], E[X][X])]}.
(19) Note that, as in Eq (18), P (Y |Z, X) is only proportional to exp[g(Z, X)] instead of strictly equalling to, we only have WGM(P (Y |Z, X)) ≈ exp{[g(E[Z|X][Z], E[X][X])]} in Eq (19) instead of equalling to.
(19) は、eq (18) において p (y |z, x) は厳密に等化ではなく exp[g(z, x)] に比例するだけであることに注意し、wgm(p(y |z, x)) ) exp{[g(e[z|x][z], e[x][x]]} を等化ではなく eq (19) にしか持たないことに注意する。 訳抜け防止モード: (19)Eq(18)のように注意。 P ( Y | Z, X ) は exp[g(Z) にのみ比例する。 X ) ] を厳密に等しくするのではなく、WGM(P ( Y |Z, X ) ) > exp{[g(E[Z|X][Z ]) しか持たない。 Eq (19 ) における E[X][X ] ) ] } は . に等しくない。
0.83
Furthermore, to guarantee the sum of P (Y |do(X)) to be 1, we use a Softmax layer to normalize these exponential units: P (Y |do(X)) ≈ Softmax(g(E[Z|X][Z], E[X][X])), (20) where the first part E[Z|X][Z] is In-Sample Sampling (ISSampling) and the second part E[X][X] is CS-Sample Sampling (CS-Sampling).
Since the Softmax layer normalizes these exponential terms, this is called the normalized weighted geometric mean (NWGM) approximation.
Softmax層はこの指数項を正規化するため、正規化重み付き幾何学平均(NWGM)近似と呼ばれる。
0.79
In a network, the variables X and Z are represented by the embedding vectors and thus we use x and z to denote them.
ネットワークでは、変数 x と z は埋め込みベクトルによって表現されるので、x と z を使ってそれらを表す。
0.82
Following the convention in attention research where the attended vectors are usually represented in the matrix form, we also pack the estimated IS-Sampling and CS-Sampling vectors to ˆX, ˆZ.
In this way, we have: P (Y |do(X)) ≈ Softmax[g( ˆZ, ˆX)],
このようにして P (Y |do(X)) ^ Softmax[g( sZ, sX)],
0.58
(21) which is given in Eq (5) of the submitted manuscript.
(21) 提出された写本のEq (5)に記載されている。
0.81
To estimate ˆZ, researchers usually calculate a query set from X: QI = h(X) and use it in the Q-K-V operation.
Z を推定するために、研究者は通常、X から QI = h(X) を計算し、Q-K-V 演算で使用する。
0.78
Similarly, to estimate ˆX, we can also calculate a query set as: QC = f (X) and use it in the Q-K-V operation.
同様に、X を推定するために、QC = f (X) というクエリ集合を計算し、Q-K-V 演算で使用することもできる。
0.76
In this way, we have Eq (5) in the submitted manuscript: P (Y |do(X)) ≈ Softmax[g( ˆZ, ˆX)], IS-Sampling: CS-Sampling: Note that although P (X) in CS-Sampling does not condition on any variable, we still require a query in Q-K-V opˆx = (cid:80) eration, since without a query, the estimated result will degrade into a fixed single vector for each different input X: x P (x)x, where P (x) is the prior probability.
このようにして、提出された原稿には eq (5) がある: p (y |do(x)) , softmax[g(\z, \x)], is-sampling: cs-sampling: cs-sampling: cs-samplingの p (x) は任意の変数の条件を満たしていないが、q-k-v op\x = (cid:80) eration のクエリが必要である。 訳抜け防止モード: このようにして、提出された写本に Eq (5 ) があります: P ( Y |do(X ) ) ) ・ Softmax[g ( ・Z ・ ・ X ) ], IS - Sampling : CS - Sampling : 注意してください。 P ( X ) in CS - サンプリングは任意の変数に条件を付けません。 我々はまだQ - K - Vオプシス = ( cid:80 ) erationでクエリを必要とします。 クエリがなければ、推定結果は各異なる入力 X : x P (x)x に対して固定された単一ベクトルに分解される。 ここで P (x ) は先行確率である。
0.74
We can also treat it as the strategy to increase the representation power of the whole model.
The first two rows show six examples of image captioning and the last two rows show the examples of VQA.
最初の2行は画像キャプションの例が6つ、最後の2行はVQAの例である。
0.71
For example, in the left example of the first row, after incorporating the CATT module, BUTD [3] generates correctly gender of the person without using the spurious correlation between “woman” with “kitchen” in the dataset.
(23) where w is a trainable vector and Wk, Wq are two trainable matrices.
(23) w は訓練可能なベクトルであり、Wk, Wq は2つの訓練可能な行列である。
0.68
VI, KI were both set to the RoI feature set of the current image and qI was the embedding of the sentence context, e g , the partially generated caption or the question for IC or VQA, respectively.
This dictionary was initialized by applying K-means over all the RoI features in the training set to get 1000 cluster centres and was updated during the end-to-end training.
For the VQA model, we followed [63, 3] to use the binary cross-entropy loss and applied the AdaDelta optimizer [82], which does not require to fix the learning rate, to train it 30 epochs.
Input: Q, K, V Prob: Ai = Softmax( Single-Head : Hi = AiV W V i ,
入力: Q, K, V Prob: Ai = Softmax(シングルヘッド : Hi = AiV W V i ,
0.84
QW Q i (KW K i )T √
QW Q i (KW K i ) T 。
0.80
d ) (24) ˆV = Embed([H1, ..., H8]W H ),
d ) (24) V = Embed([H1, ..., H8]W H )
0.82
Ouput: where W ∗ i and W H are all trainable matrices; Ai is the soft attention matrix for the i-th head; [·] denotes the concatenation operation, and Embed(·) means the feed-forward network and the residual operation as in [65].
Ouput: W の i および W H がすべて訓練可能な行列であるところ; Ai は i 番目の頭部のための柔らかい注意のマトリックスです; [·] は連結操作を表し、 Embed(·) は [65] のようにフィードフォワードネットワークと残りの操作を意味します。
0.81
We shared the parameters between IS-ATT and CS-ATT in each CATT to keep the outputs staying in the same feature space.
私たちは各CATTでIS-ATTとCS-ATTのパラメータを共有し、出力を同じ特徴空間に維持した。
0.75
Then compared with the original Transformer, the increments of the trainable parameters only come from the global image and word embedding dictionaries, which were initialized by applying K-means over the RoI and word embeddings of the training set.
We set the sizes of both dictionaries to 500 and the hidden size of all the attention modules to 512.
両方の辞書のサイズを500に設定し、すべてのアテンションモジュールの隠れサイズを512に設定した。
0.74
The RoI object features were the same as in BUTD+CATT.
RoIオブジェクトの機能はBUTD+CATTと同じだった。
0.83
For IC, the training included two processes: we first used the cross-entropy loss and then the self-critical reward to train the captioner 15 and 35 epochs, respectively.
We followed [79] to set the learning rate to min(2.5te−5, 1e−4), where t is the training epoch and after 10 epochs, the learning rate decayed by 0.2 every 2 epochs.
学習速度を min(2.5te−5, 1e−4) とし, t を訓練エポックとし, 10エポック後, 2エポック毎に0.2 で学習速度が低下した。
0.67
The batch size was set to 64.
バッチサイズは64に設定された。
0.71
References [1] Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, and Anton van den Hengel.
参照 [1] Ehsan Abbasnejad、Damien Teney、Amin Parvaneh、Javen Shi、Anton van den Hengel。 訳抜け防止モード: 参考文献 [1 ] Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen ShiとAnton van den Hengel。
0.77
Counterfactual vision and In Proceedings of the IEEE/CVF Conlanguage learning.
IEEE/CVFコンランゲージ学習のファクトリアルビジョンと成果
0.63
ference on Computer Vision and Pattern Recognition, pages 10044–10054, 2020.
参照: Computer Vision and Pattern Recognition, page 10044–10054, 2020。
0.88
3 [2] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould.
Red and blue index the incorrect and correct generated captions and answers, respectively.
赤と青はそれぞれ、誤りと正しい生成されたキャプションと回答をインデックスする。
0.71
12 BUTD: a woman and a dog in a kitchenCATT: a man standing next to a dog in a kitchenBUTD: a herd of sheep in a fieldCATT: a herd of sheep walking down a roadBUTD: a blue and red fire hydrant on a sidewalkCATT: a blue and yellow fire hydrant on the side of a streetTF: a group of people riding a horseCATT: a horse drawn carriage on a field with peopleBUTD: a desk with four laptopsCATT: two computer monitors and two laptops on a deskTF: a man feeding a cowCATT: a man milking a cow with a bottleWhat gender is the person holding the frisbee?TF: male CATT: femaleWhat does it look like the skier is doing?TF: snowboarding CATT: fallingHow many people are shown?
12 BUTD: a woman and a dog in a kitchenCATT: a man standing next to a dog in a kitchenBUTD: a herd of sheep in a fieldCATT: a herd of sheep walking down a roadBUTD: a blue and red fire hydrant on a sidewalkCATT: a blue and yellow fire hydrant on the side of a streetTF: a group of people riding a horseCATT: a horse drawn carriage on a field with peopleBUTD: a desk with four laptopsCATT: two computer monitors and two laptops on a deskTF: a man feeding a cowCATT: a man milking a cow with a bottleWhat gender is the person holding the frisbee?TF: male CATT: femaleWhat does it look like the skier is doing?TF: snowboarding CATT: fallingHow many people are shown?
0.85
?TF: 2 CATT: 3What sport is being shown on the screen?LXMERT: dancing CATT: bowlingWhat the color of the building in the background?LXMERT: blue CATT: brownHow many elephants are shown?LXMERT: 2 CATT: 4
Neural machine translation by jointly learning to align and translate.
整列と翻訳を共同で学習することで、ニューラルマシン翻訳を行う。
0.60
arXiv preprint arXiv:1409.0473, 2014.
arXiv preprint arXiv:1409.0473, 2014
0.80
1, 2 [6] Satanjeev Banerjee and Alon Lavie.
1, 2, [6] Satanjeev Banerjee と Alon Lavie。
0.87
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.
Meteor: 人間の判断と相関性を改善したmt評価のための自動指標。
0.78
In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
The Proceedings of the acl Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, page 65–72, 2005。
0.96
6 [7] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al Relational inductive biases, deep learning, and graph networks.
6 [7] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al Relational Inductive biases, Deep Learning, and graph Network。
0.87
arXiv preprint arXiv:1806.01261, 2018.
arXiv preprint arXiv:1806.01261, 2018
0.79
1 [8] Hedi Ben-Younes, R´emi Cadene, Matthieu Cord, and Nicolas Thome.
1[8]Hedi Ben-Younes, R ́emi Cadene, Matthieu Cord, Nicolas Thome。
0.91
Mutan: Multimodal tucker fusion for visual In Proceedings of the IEEE internaquestion answering.
Mutan: IEEEインターナクシエーション回答の視覚的推論のためのマルチモーダルタッカー融合。
0.73
tional conference on computer vision, pages 2612–2620, 2017.
コンピュータビジョンに関する会議、ページ2612-2620、2017。
0.80
7 [9] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S´ebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal.
7[9]Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S ́ebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, Christopher Pal。
0.82
A meta-transfer objective for learning to disentangle causal mechanisms.
因果的メカニズムを乱す学習のためのメタトランスファー目的
0.72
arXiv preprint arXiv:1901.10912, 2019.
arXiv preprint arXiv:1901.10912, 2019
0.81
3 [10] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
3[10]Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, Adam T Kalai。
0.82
Man is to computer programmer as woman is to homemaker?
男性はコンピュータープログラマーであり、女性はホームメイカーですか?
0.76
debiasing word embedIn Advances in neural information processing sysdings.
debiasing word embedded神経情報処理シスディングの進歩。
0.65
tems, pages 4349–4357, 2016.
2016年、4349-4357頁。
0.58
3 [11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.
3[11] Nicolas Carion氏、Francisco Massa氏、Gabriel Synnaeve氏、Nicolas Usunier氏、Alexander Kirillov氏、Sergey Zagoruyko氏。
0.81
Endto-end object detection with transformers.
トランスを用いたエンドツーエンドオブジェクト検出
0.63
arXiv preprint arXiv:2005.12872, 2020.
arXiv preprint arXiv:2005.12872, 2020
0.81
1, 2 [12] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever.
1, 2, [12] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever。
0.81
In Proceedings of the Generative pretraining from pixels.
Proceedings of the Generative pretraining from pixels において。
0.89
37th International Conference on Machine Learning, volume 1, 2020.
第37回機械学習国際会議、第1巻、2020年。
0.70
1, 2 [13] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll´ar, and C Lawrence Zitnick.
1, 2 [13] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ́ar, C Lawrence Zitnick。
0.88
Microsoft coco captions: Data collection and evaluation server.
マイクロソフトのココキャプション:データ収集および評価サーバー。
0.70
arXiv preprint arXiv:1504.00325, 2015.
arXiv preprint arXiv:1504.00325, 2015
0.81
2, 5 [14] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
2, 5 [14] Yen-Chun Chen、Linjie Li、Licheng Yu、Ahmed El Kholy、Faisal Ahmed、Zhe Gan、Yu Cheng、Jingjing Liu。 訳抜け防止モード: 2 5 [14 ]円-陳忠、林江利、 Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan ユ・チェン(Yu Cheng)とジンジュ・リウ(Jingjing Liu)。
0.81
Uniter: In European Universal image-text representation learning.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 499–515, 2018.
Proceedings of the European Conference on Computer Vision (ECCV) 2018年4月5日閲覧。
0.73
11 [24] Andrej Karpathy and Li Fei-Fei.
11 24] Andrej KarpathyとLi Fei-Fei。
0.84
Deep visual-semantic alignments for generating image descriptions.
画像記述を生成するための深いビジュアルセマンティックアライメント。
0.63
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
Proceedings of the IEEE conference on computer vision and pattern recognition, page 3128–3137, 2015
0.81
5 [25] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang.
5 [25] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang。
0.85
Hadamard arXiv preprint product for low-rank bilinear pooling.
Hadamard arXiv の低ランクバイリニアプール用プリプリント製品。
0.65
arXiv:1610.04325, 2016.
arXiv:1610.04325, 2016
0.70
7 [26] Diederik P Kingma and Jimmy Ba.
7 [26] Diederik P KingmaとJimmy Ba。
0.82
Adam: A method for arXiv preprint arXiv:1412.6980,
Adam: arXiv preprint arXiv:1412.6980,
0.89
stochastic optimization. 2014.
確率最適化。 2014.
0.74
11 [27] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath.
11 [27]Murat Kocaoglu、Christopher Snyder、Alexandros G Dimakis、Sriram Vishwanath。
0.74
Causalgan: Learning causal implicit generative models with adversarial training.
Causalgan: 対人訓練による因果的暗黙的生成モデル学習。
0.80
arXiv preprint arXiv:1709.02023, 2017.
arXiv preprint arXiv:1709.02023, 2017
0.79
3 [28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al Visual genome: Connecting language and vision using crowdsourced dense International Journal of Computer Viimage annotations.
3 Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al Visual DNA: Connecting language and vision using crowdsourced dense International Journal of Computer Viimageアノテーション。
0.81
sion, 123(1):32–73, 2017.
sion, 123(1):32-73, 2017
0.87
2, 6 13
2, 6 13
0.85
英語(論文から抽出)
日本語訳
スコア
[29] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov.
[29]栗田慶太、ニディ・ビアス、Ayush Pareek、Alan W Black、Yulia Tsvetkov。
0.70
Measuring bias in contextualized word representations.
文脈的単語表現におけるバイアスの測定
0.64
arXiv preprint arXiv:1906.07337, 2019.
arXiv preprint arXiv:1906.07337, 2019
0.81
3 [30] Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang, Bin Bi, and Songfang Huang.
2 [32] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al Oscar: Object-semantics aligned pre-training for vision-language tasks.
2 [32]Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al Oscar: 視覚言語タスクの事前トレーニングにオブジェクトセマンティクスが対応した。
0.86
In European Conference on Computer Vision, pages 121–137.
European Conference on Computer Vision”. 121–137頁。
0.88
Springer, 2020.
スプリンガー、2020年。
0.59
2 [33] Chin-Yew Lin.
2 33]Chin-Yew Lin。
0.84
Rouge: A package for automatic evaluation of summaries.
Discriminability objective for training descriptive captions.
記述的キャプションの訓練のための識別可能性目標
0.53
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6964– 6974, 2018.
IEEE Conference on Computer Vision and Pattern RecognitionのProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 6964– 6974, 2018
0.76
1 [40] Moin Nadeem, Anna Bethke, and Siva Reddy.
1 [40]Moin Nadeem、Anna Bethke、Siva Reddy。
0.75
Stereoset: Measuring stereotypical bias in pretrained language models.
Deep modular co-attention networks for visual question answering.
視覚的質問応答のための深いモジュール型コアテンションネットワーク。
0.59
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6281–6290, 2019.
コンピュータビジョンとパターン認識に関するIEEEカンファレンスのProceedingsで、ページ6281-6290、2019。 訳抜け防止モード: In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6281-6290、2019年。
0.75
2, 7, 11 [80] Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua.