PatchGame: Learning to Signal Mid-level Patches in Referential Games
- URL: http://arxiv.org/abs/2111.01785v1
- Date: Tue, 2 Nov 2021 17:59:00 GMT
- Title: PatchGame: Learning to Signal Mid-level Patches in Referential Games
- Authors: Kamal Gupta, Gowthami Somepalli, Anubhav Gupta, Vinoj Jayasundara,
Matthias Zwicker, Abhinav Shrivastava
- Abstract summary: We study a referential game where two agents communicate with each other via a discrete bottleneck to achieve a common goal.
In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the listener is to match the speaker's message to a different view of the same image.
We show that it is indeed possible for the two agents to develop a communication protocol without explicit or implicit supervision.
- Score: 38.79852742348459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a referential game (a type of signaling game) where two agents
communicate with each other via a discrete bottleneck to achieve a common goal.
In our referential game, the goal of the speaker is to compose a message or a
symbolic representation of "important" image patches, while the task for the
listener is to match the speaker's message to a different view of the same
image. We show that it is indeed possible for the two agents to develop a
communication protocol without explicit or implicit supervision. We further
investigate the developed protocol and show the applications in speeding up
recent Vision Transformers by using only important patches, and as pre-training
for downstream recognition tasks (e.g., classification). Code available at
https://github.com/kampta/PatchGame.
Related papers
- Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Listener Model for the PhotoBook Referential Game with CLIPScores as
Implicit Reference Chain [0.9558392439655015]
PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common.
We propose a reference chain-free listener model that directly addresses the game's predictive task, i.e., deciding whether an image is shared with partner.
Our DeBERTa-based listener model reads the full dialogue, and utilizes CLIPScore features to assess utterance-image relevance.
arXiv Detail & Related papers (2023-06-16T03:41:14Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Learning Multi-Object Positional Relationships via Emergent
Communication [16.26264889682904]
We train agents in a referential game where observations contain two objects, and find that generalization is the major problem when the positional relationship is involved.
We find that the learned language can generalize well in a new multi-step MDP task where the positional relationship describes the goal, and performs better than raw-pixel images as well as pre-trained image features.
We also show that language transfer from the referential game performs better in the new task than learning language directly in this task, implying the potential benefits of pre-training in referential games.
arXiv Detail & Related papers (2023-02-16T04:44:53Z) - Learning to Communicate with Intent: An Introduction [2.007345596217044]
We propose a novel framework to learn how to transmit messages over a wireless communication channel based on the end-goal of the communication.
This stays in stark contrast to classical communication systems where the objective is to reproduce at the receiver side either exactly or approximately the message sent by the transmitter.
arXiv Detail & Related papers (2022-11-17T16:02:13Z) - Unsupervised Visual Representation Learning by Tracking Patches in Video [88.56860674483752]
We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
arXiv Detail & Related papers (2021-05-06T09:46:42Z) - The emergence of visual semantics through communication games [0.0]
Communication systems which capture visual semantics can be learned in a completely self-supervised manner by playing the right types of game.
Our work bridges a gap between emergent communication research and self-supervised feature learning.
arXiv Detail & Related papers (2021-01-25T17:43:37Z) - A Framework for Generative and Contrastive Learning of Audio
Representations [2.8935588665357077]
We present a framework for contrastive learning for audio representations in a self supervised frame work without access to ground truth labels.
We also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model.
arXiv Detail & Related papers (2020-10-22T05:52:32Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.