Related papers: aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption

aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption

URL: http://arxiv.org/abs/2001.09545v1
Date: Mon, 27 Jan 2020 00:19:41 GMT
Title: aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption
Authors: Chiranjib Sur
Abstract summary: Region visual features enhance the generative capability of the machines based on features, however they lack proper interaction attentional perceptions. In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR) which is a convenient way of gathering more information.
Score: 9.89901717499058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Region visual features enhance the generative capability of the machines based on features, however they lack proper interaction attentional perceptions and thus ends up with biased or uncorrelated sentences or pieces of misinformation. In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR) which is a convenient way of gathering more information through orthogonal combination and learning the interactions as physical entities (tensors) and improving the captions. Compared to previous works, where features are added up to undefined feature spaces, TPR helps in maintaining sanity in combinations and orthogonality helps in defining familiar spaces. We have introduced a new concept layer that defines the objects and also their interactions that can play a crucial role in determination of different descriptions. The interaction portions have contributed heavily for better caption quality and has out-performed different previous works on this domain and MSCOCO dataset. We introduced, for the first time, the notion of combining regional image features and abstracted interaction likelihood embedding for image captioning.

Related papers

Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation [48.76848912120607]
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation.<n>We propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg)<n>Our framework consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA)
arXiv Detail & Related papers (2025-07-16T16:29:30Z)
Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z)
Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings. We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features. Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z)
Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z)
Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning [1.4337588659482516]
This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information. We propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features. Our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
arXiv Detail & Related papers (2023-02-08T09:15:09Z)
Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space. The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z)
Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features [11.48760300147023]
This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN) MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
arXiv Detail & Related papers (2021-11-30T10:22:11Z)
Enhancing Social Relation Inference with Concise Interaction Graph and Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE) It concisely learns interactive features of persons and discriminative features of holistic scenes. PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z)
Dual-Level Collaborative Transformer for Image Captioning [126.59298716978577]
We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
arXiv Detail & Related papers (2021-01-16T15:43:17Z)
Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.