Context-Aware Group Captioning via Self-Attention and Contrastive
Features
- URL: http://arxiv.org/abs/2004.03708v1
- Date: Tue, 7 Apr 2020 20:59:53 GMT
- Title: Context-Aware Group Captioning via Self-Attention and Contrastive
Features
- Authors: Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, Alan Yuille
- Abstract summary: We introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images.
To solve this problem, we propose a framework combining self-attention mechanism with contrastive feature construction.
Our datasets are constructed on top of the public Conceptual Captions dataset and our new Stock Captions dataset.
- Score: 31.94715153491951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While image captioning has progressed rapidly, existing works focus mainly on
describing single images. In this paper, we introduce a new task, context-aware
group captioning, which aims to describe a group of target images in the
context of another group of related reference images. Context-aware group
captioning requires not only summarizing information from both the target and
reference image group but also contrasting between them. To solve this problem,
we propose a framework combining self-attention mechanism with contrastive
feature construction to effectively summarize common information from each
image group while capturing discriminative information between them. To build
the dataset for this task, we propose to group the images and generate the
group captions based on single image captions using scene graphs matching. Our
datasets are constructed on top of the public Conceptual Captions dataset and
our new Stock Captions dataset. Experiments on the two datasets show the
effectiveness of our method on this new task. Related Datasets and code are
released at https://lizw14.github.io/project/groupcap .
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning [40.01197694624958]
We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
arXiv Detail & Related papers (2023-06-01T17:34:25Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Partially-supervised novel object captioning leveraging context from
paired data [11.215352918313577]
We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs.
We further re-use these partially paired images with novel objects to create pseudo-label captions.
Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
arXiv Detail & Related papers (2021-09-10T21:31:42Z) - Who's Waldo? Linking People Across Text and Images [56.40556801773923]
We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
arXiv Detail & Related papers (2021-08-16T17:36:49Z) - Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts.
Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.