Listener Model for the PhotoBook Referential Game with CLIPScores as
Implicit Reference Chain
- URL: http://arxiv.org/abs/2306.09607v1
- Date: Fri, 16 Jun 2023 03:41:14 GMT
- Title: Listener Model for the PhotoBook Referential Game with CLIPScores as
Implicit Reference Chain
- Authors: Shih-Lun Wu, Yi-Hui Chou, and Liangze Li
- Abstract summary: PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which images they have in common.
We propose a reference chain-free listener model that directly addresses the game's predictive task, i.e., deciding whether an image is shared with partner.
Our DeBERTa-based listener model reads the full dialogue, and utilizes CLIPScore features to assess utterance-image relevance.
- Score: 0.9558392439655015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: PhotoBook is a collaborative dialogue game where two players receive private,
partially-overlapping sets of images and resolve which images they have in
common. It presents machines with a great challenge to learn how people build
common ground around multimodal context to communicate effectively. Methods
developed in the literature, however, cannot be deployed to real gameplay since
they only tackle some subtasks of the game, and they require additional
reference chains inputs, whose extraction process is imperfect. Therefore, we
propose a reference chain-free listener model that directly addresses the
game's predictive task, i.e., deciding whether an image is shared with partner.
Our DeBERTa-based listener model reads the full dialogue, and utilizes
CLIPScore features to assess utterance-image relevance. We achieve >77%
accuracy on unseen sets of images/game themes, outperforming baseline by >17
points.
Related papers
- TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Collecting Visually-Grounded Dialogue with A Game Of Sorts [5.478764356647438]
We introduce a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts"
In our game, players are tasked with reaching agreement on how to rank a set of images given some sorting criterion through a largely unrestricted, role-symmetric dialogue.
We describe results of a small-scale data collection experiment with the proposed task.
arXiv Detail & Related papers (2023-09-10T23:00:35Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - PatchGame: Learning to Signal Mid-level Patches in Referential Games [38.79852742348459]
We study a referential game where two agents communicate with each other via a discrete bottleneck to achieve a common goal.
In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the listener is to match the speaker's message to a different view of the same image.
We show that it is indeed possible for the two agents to develop a communication protocol without explicit or implicit supervision.
arXiv Detail & Related papers (2021-11-02T17:59:00Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.