Structured Multi-modal Feature Embedding and Alignment for
  Image-Sentence Retrieval
        - URL: http://arxiv.org/abs/2108.02417v1
- Date: Thu, 5 Aug 2021 07:24:54 GMT
- Title: Structured Multi-modal Feature Embedding and Alignment for
  Image-Sentence Retrieval
- Authors: Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu
- Abstract summary: The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments.
We propose a novel Structured Multi-modal Feature Embedding and Alignment model for image-sentence retrieval.
In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels.
- Score: 12.050958976545914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   The current state-of-the-art image-sentence retrieval methods implicitly
align the visual-textual fragments, like regions in images and words in
sentences, and adopt attention modules to highlight the relevance of
cross-modal semantic correspondences. However, the retrieval performance
remains unsatisfactory due to a lack of consistent representation in both
semantics and structural spaces. In this work, we propose to address the above
issue from two aspects: (i) constructing intrinsic structure (along with
relations) among the fragments of respective modalities, e.g., "dog $\to$ play
$\to$ ball" in semantic structure for an image, and (ii) seeking explicit
inter-modal structural and semantic correspondence between the visual and
textual modalities. In this paper, we propose a novel Structured Multi-modal
Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In
order to jointly and explicitly learn the visual-textual embedding and the
cross-modal alignment, SMFEA creates a novel multi-modal structured module with
a shared context-aware referral tree. In particular, the relations of the
visual and textual fragments are modeled by constructing Visual Context-aware
Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree
encoder (TCS-Tree) with shared labels, from which visual and textual features
can be jointly learned and optimized. We utilize the multi-modal tree structure
to explicitly align the heterogeneous image-sentence data by maximizing the
semantic and structural similarity between corresponding inter-modal tree
nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks
demonstrate the superiority of the proposed model in comparison to the
state-of-the-art methods.
 
      
        Related papers
        - Visual Semantic Description Generation with MLLMs for Image-Text   Matching [7.246705430021142]
 We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantics.<n>Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency.
 arXiv  Detail & Related papers  (2025-07-11T13:38:01Z)
- Embedding and Enriching Explicit Semantics for Visible-Infrared Person   Re-Identification [31.011118085494942]
 Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities.
Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics.
We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
 arXiv  Detail & Related papers  (2024-12-11T14:27:30Z)
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
 We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
 arXiv  Detail & Related papers  (2024-09-12T16:46:41Z)
- Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for   Image-Text Matching [7.7559623054251]
 Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
 arXiv  Detail & Related papers  (2024-06-05T13:10:55Z)
- Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval [32.793170116202475]
 We show the discrepancy between image-to-text association and text-to-image association.
We propose CADA: Cross-Modal Adaptive Dual Association that finely builds bidirectional image-text detailed associations.
 arXiv  Detail & Related papers  (2023-12-04T09:10:24Z)
- Progressive Tree-Structured Prototype Network for End-to-End Image
  Captioning [74.8547752611337]
 We propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN)
 PTSN is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics.
Our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server.
 arXiv  Detail & Related papers  (2022-11-17T11:04:00Z)
- Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
 Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
 arXiv  Detail & Related papers  (2022-11-14T11:41:44Z)
- BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
  Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
 Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
 arXiv  Detail & Related papers  (2022-07-09T07:14:44Z)
- Finding Structural Knowledge in Multimodal-BERT [18.469318775468754]
 We make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image.
We call this explicit visual structure the textitscene tree, that is based on the dependency tree of the language description.
 arXiv  Detail & Related papers  (2022-03-17T13:20:01Z)
- Matching Visual Features to Hierarchical Semantic Topics for Image
  Paragraph Captioning [50.08729005865331]
 This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
 arXiv  Detail & Related papers  (2021-05-10T06:55:39Z)
- Linguistic Structure Guided Context Modeling for Referring Image
  Segmentation [61.701577239317785]
 We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
 arXiv  Detail & Related papers  (2020-10-01T16:03:51Z)
- Improving Image Captioning with Better Use of Captions [65.39641077768488]
 We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
 arXiv  Detail & Related papers  (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.