Dense Relational Image Captioning via Multi-task Triple-Stream Networks
- URL: http://arxiv.org/abs/2010.03855v3
- Date: Mon, 11 Oct 2021 08:49:57 GMT
- Title: Dense Relational Image Captioning via Multi-task Triple-Stream Networks
- Authors: Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon
- Abstract summary: We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
- Score: 95.0476489266988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce dense relational captioning, a novel image captioning task which
aims to generate multiple captions with respect to relational information
between objects in a visual scene. Relational captioning provides explicit
descriptions for each relationship between object combinations. This framework
is advantageous in both diversity and amount of information, leading to a
comprehensive image understanding based on relationships, e.g., relational
proposal generation. For relational understanding between objects, the
part-of-speech (POS; i.e., subject-object-predicate categories) can be a
valuable prior information to guide the causal sequence of words in a caption.
We enforce our framework to learn not only to generate captions but also to
understand the POS of each word. To this end, we propose the multi-task
triple-stream network (MTTSNet) which consists of three recurrent units
responsible for each POS which is trained by jointly predicting the correct
captions and POS for each word. In addition, we found that the performance of
MTTSNet can be improved by modulating the object embeddings with an explicit
relational module. We demonstrate that our proposed model can generate more
diverse and richer captions, via extensive experimental analysis on large scale
datasets and several metrics. Then, we present applications of our framework to
holistic image captioning, scene graph generation, and retrieval tasks.
Related papers
- What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation [19.987706084203523]
We propose Panoptic Perception, a novel task and a new fine-grained dataset (FineGrip) to achieve a more thorough and universal interpretation for RSIs.
The new task integrates pixel-level, instance-level, and image-level information for universal image perception.
FineGrip dataset includes 2,649 remote sensing images, 12,054 fine-grained instance segmentation masks belonging to 20 foreground things categories, 7,599 background semantic masks for 5 stuff classes and 13,245 captioning sentences.
arXiv Detail & Related papers (2024-04-06T12:27:21Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.