Related papers: Image Captioning with Visual Object Representations Grounded in the Textual Modality

Image Captioning with Visual Object Representations Grounded in the Textual Modality

URL: http://arxiv.org/abs/2010.09413v2
Date: Tue, 20 Oct 2020 12:24:39 GMT
Title: Image Captioning with Visual Object Representations Grounded in the Textual Modality
Authors: Du\v{s}an Vari\v{s}, Katsuhito Sudoh, and Satoshi Nakamura
Abstract summary: We explore the possibilities of a shared embedding space between textual and visual modality. We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
Score: 14.797241131469486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.

Related papers

Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses [31.85977999591524]
Vision-Language Models implicitly learn to associate image regions with words from large-scale training data. Rich semantic and syntactic structures within the text modality have been overlooked as sources of supervision. Hierarchically STructured Learning (HIST) enhances spatial vision-language alignment without using additional human annotations.
arXiv Detail & Related papers (2024-12-11T05:36:18Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z)
Hyperbolic Contrastive Learning for Visual Representations beyond Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Consensus Graph Representation Learning for Better Grounded Image Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z)
Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner. We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z)
Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework. We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency. Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z)
Incorporating Visual Semantics into Sentence Representations within a Grounded Space [20.784771968813747]
We propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
arXiv Detail & Related papers (2020-02-07T12:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.