As deep networks begin to be deployed as autonomous agents, the issue of how
they can communicate with each other becomes important. Here, we train two deep
nets from scratch to perform realistic referent identification through
unsupervised emergent communication. We show that the largely interpretable
emergent protocol allows the nets to successfully communicate even about object
types they did not see at training time. The visual representations induced as
a by-product of our training regime, moreover, show comparable quality, when
re-used as generic visual features, to a recent self-supervised learning model.
Our results provide concrete evidence of the viability of (interpretable)
emergent deep net communication in a more realistic scenario than previously
considered, as well as establishing an intriguing link between this field and
self-supervised visual learning.
Our results provide concrete evidence of the viability of (interpretable) emergent deep net communication in a more realistic scenario than previously considered, as well as establishing an intriguing link between this ﬁeld and self-supervised visual learning.1
1 Introduction As deep networks become more effective at solving specialized tasks, there has been interest in letting them develop a language-like communication protocol so that they can ﬂexibly interact to address joint tasks .
Such ability would for example allow deep-net-controlled agents, such as self-driving cars, to inform each other about the presence and nature of potentially dangerous objects, besides being a basic requirement to support more advanced capabilities (e g , denoting relations between objects).
We exploit this insight to develop a robust end-to-end variant of a communication game.
Our experiments conﬁrm that, in our setup: i) the nets develop a set of discrete symbols allowing them to successfully discriminate objects in natural images, including novel ones that were not shown during training; ii) these symbols denote interpretable categories, so that their emergence constitutes a form of fully unsupervised image annotation; iii) the visual features induced as a by-product can be used as high-quality general-purpose representations, whose
The typical setup is that of a referential, or discriminative, communication game.
In the simplest scenario, which we adopt here, an agent, the Sender, sees one input (the target) and it sends a discrete symbol to another agent, the Receiver, that sees an array of items including the target, and has to point to the latter for communication to be deemed successful.
In one of the earliest papers in this line of research, Lazaridou et al  used images from ImageNet  as input to the discrimination game; Havrylov and Titov  used MSCOCO ; and Evtimova et al  used animal images from Flickr.
この研究の最初期の論文の1つとして、LazaridouらはImageNet のイメージを差別ゲームへの入力として使用し、HabrilovとTitov はMSCOCO 、Evtimova et al はFlickrの動物画像を使用した。 訳抜け防止モード: この研究の最初期の論文の1つに挙げられます。 Lazaridou et al [9 ] は差別ゲームへの入力として ImageNet [17 ] の画像を使用した ; Havrylov と Titov [11 ] は MSCOCO [18 ] を使用した Evtimova et al [12 ]はFlickrの動物画像を使った。
While they used natural images, all these studies were limited to small sets of carefully selected object categories.
Lazaridou et al  and Choi et al  dispensed with pre-trained CNNs, but they used synthetically generated geometric shapes as inputs.
Lazaridou et al  と Choi et al  は事前訓練されたCNNを不要にしたが、彼らは合成的に生成された幾何学的形状を入力として使用した。
Results on the interpretability of symbols in games with realistic inputs have generally been mixed.
Indeed, Bouchacourt and Baroni  showed that, after training Lazaridou et al ’s networks on real pictures, the networks could use the learned protocol to successfully communicate about blobs of Gaussian noise, suggesting that their code (also) denoted low-level image features, differently from the general semantic categories that words in human language refer to.
bouchacourt と baroni  は、実際の画像上で lazaridou と al  のネットワークを訓練した後、ネットワークは学習プロトコルを使ってガウスノイズのブロブをうまく通信できることを示した。 訳抜け防止モード: 実際、ブーカクールとバロニは、その後、 Lazaridou et al [9 ] のネットワークを実際の画像でトレーニングする。 ネットワークは学習したプロトコルを使って ガウスノイズの塊について うまく通信できる 彼らのコードは (また) 人間の言語で単語が参照する一般的な意味カテゴリーとは異なる、低レベルの画像特徴を示す。
In part for this reason, recent work tends to focus on controlled symbolic inputs, where it is easier to detect degenerate communication strategies, rather than attempting to learn communication “in the wild” [e g , 10, 15, 16].
The main conceptual differences are that there is no discrete bottleneck imposed on “communication” between the networks, and there is no asymmetry, so that both networks act simultaneously as Sender and Receiver (both networks produce a continuous “message” that must be as discriminative for the other network as possible).
evolving a more semantically interpretable protocol.
Second, we evaluate the discrimination game as a self-supervised feature extraction method.
We ﬁnd that the visual features induced by the CNNs embedded in our agents are virtually as good as those induced by SimCLR, while the emergent protocol is better for communication than the one obtained by adapting SimCLR to the discrete communication setup.
Sender reads the target image through a convolutional module, followed by a one-layer network mapping the output of the CNN onto |V | dimensions and applying batch normalization , to obtain vector v. Following common practice when optimizing through discrete bottlenecks, we then compute the Gumbel-Softmax continuous relaxation [28, 29], which was shown to also be effective in the emergent (cid:80) communication setup .
At train time, Sender produces an approximation to a one-hot symbol vector with each component given by mi = exp [(si+vi)/τ ] j exp [(sj +vj )/τ ], where si is a random sample from Gumbel(0,1) and vi a dimension of v. The approximation is controlled by temperature parameter τ: as τ approaches 0, the approximation approaches a one-hot vector, and as τ approaches +∞, the relaxation becomes closer to uniform.
As they are different agents, that could (in future experiments) have very different architectures and interact with further agents, the most natural assumption is that each of them does visual processing with its own CNN.
Optimization Optimization is performed end-to-end and the error signal, backpropagated through Receiver and Sender, is computed using the cross-entropy cost function by comparing the Receiver’s output with a one-hot vector representing the position of the target in the image list.
SimCLR as a comparison model Given the similarity between the referential communication game and contrastive self-supervised learning in SimCLR , we use the latter as a comparison point for our approach.
iii) Instead of (a probability distribution over) symbols, the exchanged information takes the form of continuous vectors (s in the ﬁgure).
iii) (確率分布) 記号の代わりに、交換された情報は連続ベクトルの形をとる(図中のs)。
iv) The loss is based on directly comparing embeddings of these continuous vectors (z in the ﬁgure), maximizing the similarity between pairs representing the same images (positive examples in contrastive-loss terminology) and minimizing that of pairs representing different images (negative examples).
This differs from our loss, that maximizes the similarity of the Receiver embedding of the Sender-produced discrete symbol with its own representation of the target image, while minimizing the similarity of the symbol embedding with its representation of the distractors.
In standard contrastive learning frameworks, where all the weights are shared and there is no communication bottleneck, it is necessary to create these different views, or else the system would trivially succeed at the pretext contrastive task without any actual feature learning.
We conjecture that data augmentation, while not strictly needed, might also be beneﬁcial in the communication game setup: presenting different views of the target to Sender and Receiver should make it harder for them to adopt degenerate strategies based on low-level image information .2 We follow the same data augmentation pipeline as , stochastically applying crop-and-resize, color perturbation, and random Gaussian blurring to every image.
Implementation details All hidden and output layers are set to dimensionality 2048.3 Note that this implies |V | = 2048, more than double the categories in the dataset we use to train the model (see Section 3.2 below), to avoid implicit supervision on optimal symbol count.
Rather than sampling distractors from the entire dataset, we take them from the
2Lazaridou and colleagues  also considered a variant of the game in which the agents see different pictures of the same category (e g , the shared target is dog, but the agents get different dog pictures).
This version of the game is however severely limited by the requirement of manual category annotation.
Lazaridou et al  also provide different images to Sender and Receiver, by feeding them different viewpoints of the same synthetically generated objects: again, a strategy that will not scale up to natural images.
lazaridou氏らは、同じ合成生成されたオブジェクトの異なる視点を送信者と受信者に提供することで、異なるイメージを提供する。 訳抜け防止モード: Lazaridou et al [ 13 ] も Sender と Receiver に異なるイメージを提供している。 同じ合成された物体の異なる視点を 与えることで 自然画像にスケールアップしない戦略。
3This is the same size used in the original SimCLR paper, except for the nonlinear projection head.
3) これは, 非線形投射ヘッドを除いて, オリジナルのSimCLR紙と同じサイズである。
For the latter, a number of sizes were tested, and the authors report that they do not impact ﬁnal performance.
In particular, we randomly picked (and manually sanity-checked) 80 categories that were neither in ILSVRC-2012 nor hypernyms or hyponyms of ILSVR-2012 categories (e g , since hamster is in ILSVR-2012, we avoided both rodent and golden hamster).
Examples of included categories are eucalyptus, amoeba, and drawer.4
含まれているカテゴリの例としては、eucalyptus, amoeba, drawer.4がある。
Linear evaluation of visual features on downstream tasks Following standard practice in selfsupervised learning [e g , 5, 20, 41], we evaluate the visual features induced by the CNN components of our models by training a linear object classiﬁer on top of them.
We use four common data-sets: ILSVRC-2012, Places205 , iNaturalist2018 and VOC07.5 Evaluation is carried out with the VISSL toolkit ,6 adopting the hyperparameters in its conﬁguration ﬁles without changes.
4 Experiments 4.1 Referential communication accuracy
4つの実験 4.1 参照通信精度
We start by analyzing how well our models learn to refer to object-depicting images through a learned protocol.7 While some models use data augmentation at training time, we do not apply this transformation when testing the learned communication pipeline.
As a strong baseline, we let the trained SimCLR model play the referential game by argmax-ing its s layer (see Fig 2 above) into a discrete “symbol” (SimCLRdisc).8 Accuracy is given by the proportion of times in which a system assigns the largest symbolembedding/imag e-representation similarity to the target compared to 127 distractors (chance ≈ 0.8%).
In summary, neural networks trained from scratch are able to communicate quite accurately through a discrete channel even in the challenging setup in which the target referent is mixed with more than one hundred distractors, and when it belongs to new categories not seen at training time.
Looking at model variants, sharing CNN weights or not makes little difference (an encouraging ﬁrst step towards communication between widely differing agents, that will obviously not be able to share weights).
On the other hand, data augmentations harm performance.
However, it turns out that the better performance of the non-augmented models is due to an opaque communication strategy in which the agents are evidently referring to low-level aspects of images (perhaps, speciﬁc pixel intensity levels?
WNSim is more nuanced than nMI, as it will penalize less a Sender using the same symbol for similar categories, such as cats and dogs, than one using the same symbol for dissimilar ones, such as cats and skyscrapers.
Even when there is no data augmentation during training, using different visual modules (-augmentations -shared) leads to some protocol interpretability, coherently with the fact that this conﬁguration was less able than its +shared counterpart to communicate about noise.
We were moreover surprised to ﬁnd that simply argmaxing the SimCLR visual feature layer produces meaningful “symbols”, which suggests that information might be more sparsely encoded by this model than one could naively assume.
Recall that, unlike our models, whose protocol independently emerges during discriminative game training, SimCLRkmeans runs a clustering algorithm on top of the representations produced by the SimCLR visual encoder with the express goal to discretize them into coherent sets, thus constituting a hard competitor to reach.
Fig 3 shows a random set of such images for the 9 symbols most frequently produced by the +augmentationsshared Sender in ILSVRC-val, without any hand-picking.10 Some symbols denote intuitive categories, although, interestingly, ones that do not correspond to speciﬁc English words (birds on branches, dogs indoors.
. . ). Other sets are harder to characterize, but they still share a clear high-level “family resemblance” (Symbol 2: objects that glow in the dark; Symbol 3: human artifacts with simple ﬂat shapes, etc).
Frequency imbalance in input categories, together with the fact that the agents are allowed to use a large number of symbols, leads to partially overlapping categories (Symbol 9 might denote living things in the grass, whereas Symbol 4 seems to speciﬁcally refer to mammals in the grass).
We focus on Sender because the features produced by the two agent networks are always highly correlated.11 We further exclude the -augmentations models, given that, having learned a degenerated strategy, they extremely poor performance on ILSVRC-val (below 5% accuracy).
11Across setups and data sets, the Sender/Receiver correlation between all pairwise visual representation
similarities was never below 0.96.
類似度は 0.96 以下ではなかった。
12It is difﬁcult to compare our SimCLR ILSVRC-val performance precisely to that reported in the original paper since, coherently with the communication game setup, we use a per-GPU batch size of 128 without sharing negatives across GPUs.
Many ideas from the self-supervised literature (e g , new data augmentation pipelines, the use of memory banks for distractor sampling or variants of the similarity-based pretext task) could straightforwardly be integrated into our setup, hopefully leading to the emergence of even better visual features and, perhaps, an even more transparent protocol.
We showed instead that deep agents can learn to refer to a high number of categories depicted in large-scale image datasets, while communicating through a discrete channel and developing their visual processing modules from scratch.
Performance on referential games with two distinct test sets (one with categories not presented at training), along with protocol analysis, shows that the agents’ protocol is effective and interpretable.
Further integration with self-supervised learning methods should be explored in the future.
In our experiment, agents communicate through a single symbol, but the true expressive power of human language comes from the inﬁnite combinatorial possibilities offered by composing sequences of discrete units .
Additionally, while in our experiments distractors are selected at random, this is obviously not the case in real-life referential settings (dogs will tend to occur near other dogs or humans, rather than between a whale and a space shuttle).
Dealing with realistic category co-occurence is thus another important future direction.
Finally, although we provided quantitative and qualitative evidence that the agents’ protocol is reasonably transparent, the extent to which the achieved degree of interpretability is good enough for human-in-the-loop scenarios remains to be experimentally investigated.
Acknowledgments We would like to thank Gemma Boleda, Rahma Chaabouni, Emmanuel Chemla, Simone Conia and Lucas Weber for feedback on an earlier version of this manuscript, Priya Goyal for technical support
Imagenet: A large-scale hierarchical image database.
In Proceedings of CVPR, pages 248–255, Miami Beach, FL, 2009.
Proceedings of CVPR, page 248–255, Miami Beach, FL, 2009
 Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.
18]tsung-yi lin, michael maire, serge belongie, james hays, pietro perona, deva ramanan, piotr dollár, c. lawrence zitnick 訳抜け防止モード: [18 ]ツン-李林、マイケル・ミア、セルゲイ・ベロンギー James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár そしてC・ローレンス・ジトニック。
Microsoft COCO: Common objects in context.
Microsoft COCO: コンテキスト内の共通オブジェクト。
In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014.
David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars, editors, Computer Vision – ECCV 2014 page 740–755, Cham, 2014
Springer International Publishing.
Springer International Publishing(英語)
 Diane Bouchacourt and Marco Baroni.
19]diane bouchacourtとmarco baroni。
How agents see things: On visual representations in an emergent language game.
In Proceedings of EMNLP, pages 981–985, Brussels, Belgium, 2018.
EMNLP Proceedings of EMNLP, page 981–985, Brussels, Belgium, 2018
 Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
20]Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin。
Emerging properties in self-supervised vision transformers.
arXiv preprint arXiv:2104.14294, 2021.
arXiv preprint arXiv:2104.14294, 2021
 Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin.
 Aäron van den Oord, Yazhe Li, and Oriol Vinyals.
a b  Aäron van den Oord, Yazhe Li, Oriol Vinyals
Representation learning with contrastive
predictive coding. CoRR, abs/1807.03748, 2018.
予測符号化。 CoRR, abs/1807.03748, 2018。
 Xinlei Chen and Kaiming He.
Exploring simple Siamese representation learning.
CoRR, abs/2011.10566, 2020.
CoRR Abs/2011.10566, 2020
 Michael Gutmann and Aapo Hyvärinen.
26]michael gutmannとaapo hyvärinen。
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models.
In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 297–304, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010.
Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirth International Conference on Artificial Intelligence and Statistics, Volume 9 of Proceedings of Machine Learning Research, pages 297–304, Chia Laguna Resort, Sardinia, Italy, 13–15, 2010
PMLR.  Sergey Ioffe and Christian Szegedy.
PMLR。 27] セルゲイ・ヨッフェと クリスチャン・セゲディ
Batch normalization: Accelerating deep network training by reducing internal covariate shift.
In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015.
編集者のFrancis Bach, David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, Volume 37 of Proceedings of Machine Learning Research, page 448-456, Lille, France, 07-09 Jul 2015
PMLR.  Eric Jang, Shixiang Gu, and Ben Poole.
PMLR。 Eric Jang、Shixiang Gu、Ben Poole。
Categorical reparameterization with Gumbel-Softmax.
In Proceedings of ICLR Conference Track, Toulon, France, 2017.
in proceedings of iclr conference track, france, toulon, 2017を参照。
Published online: https: //openreview.net/gro up?id=ICLR.cc/2017/confere nce.
https: //openreview.net/gro up?id=iclr.cc/2017/confere nce
 Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
Chris J Maddison, Andriy Mnih, Yee Whye Teh。
The concrete distribution: A continuous
relaxation of discrete random variables.
arXiv preprint arXiv:1611.00712, 2016.
arXiv preprint arXiv:1611.00712, 2016
 Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
30]Xavier Glorot,Antoine Bordes,Yoshua Bengio。
Deep sparse rectiﬁer neural networks.
In Proceedings of AISTATS, pages 315–323, Fort Lauderdale, FL, 2011.
院 AISTATS Proceedings of AISTATS, pages 315–323, Fort Lauderdale, FL, 2011
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
開明、Xiangyu Zhang、Shaoqing Ren、Jian Sun。
Deep residual learning for image recognition.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
2016年のIEEE Conference on Computer Vision and Pattern Recognition (CVPR)では、770–778頁。
 Philip Bachman, Devon Hjelm, and William Buchwalter.
Philip Bachman氏、Devon Hjelm氏、William Buchwalter氏。
Learning representations by maximizing mutual information across views.
In Proceedings of NeurIPS, Vancouver, Canada, 2019.
In Proceedings of NeurIPS, Vancouver, Canada, 2019
Published online: https://papers.nips. cc/paper/2019.
公式サイト: https://papers.nips. cc/paper/2019
 Mang Ye, Xu Zhang, Pong Yuen, and Shih-Fu Chang.
mang Ye、Xu Zhang、Pong Yuen、Shih-Fu Chang。
Unsupervised embedding learning via invariant and spreading instance feature.
In Proceedings of CVPR, pages 6210–6219, Long Beach, CA, 2019.
CVPR Proceedings of CVPR, page 6210–6219, Long Beach, CA, 2019
 Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu.
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
The downstream object classiﬁcation experiments take up to about 16 hours on 8 GPUs.
A.2 Impact of random seeds on +augmentation -shared model performance
A.2 ランダム種子が+augmentation-shared model performanceに及ぼす影響
To gauge the robustness of our results to model initialization variance, we repeated all experiments after training our most representative model (+augmentation -shared) with 5 different random seeds (including the randomly picked seed consistently used for the results reported in the main text).