Constructing Phrase-level Semantic Labels to Form Multi-Grained
Supervision for Image-Text Retrieval
- URL: http://arxiv.org/abs/2109.05523v1
- Date: Sun, 12 Sep 2021 14:21:15 GMT
- Title: Constructing Phrase-level Semantic Labels to Form Multi-Grained
Supervision for Image-Text Retrieval
- Authors: Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing
Huang, Jianqing Fan
- Abstract summary: We introduce additional phrase-level supervision for the better identification of mismatched units in the text.
We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels.
For the training, we propose multi-scale matching losses from both global and local perspectives.
- Score: 48.20798265640068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing research for image text retrieval mainly relies on sentence-level
supervision to distinguish matched and mismatched sentences for a query image.
However, semantic mismatch between an image and sentences usually happens in
finer grain, i.e., phrase level. In this paper, we explore to introduce
additional phrase-level supervision for the better identification of mismatched
units in the text. In practice, multi-grained semantic labels are automatically
constructed for a query image in both sentence-level and phrase-level. We
construct text scene graphs for the matched sentences and extract entities and
triples as the phrase-level labels. In order to integrate both supervision of
sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal
Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT,
we utilize different kinds of attention mechanisms to enforce interactions of
multi-grain semantic units in both sides of vision and language. For the
training, we propose multi-scale matching losses from both global and local
perspectives, and penalize mismatched phrases. Experimental results on MS-COCO
and Flickr30K show the effectiveness of our approach compared to some
state-of-the-art models.
Related papers
- Text-Region Matching for Multi-Label Image Recognition with Missing Labels [5.095488730708477]
TRM-ML is a novel method for enhancing meaningful cross-modal matching.
We propose a category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels.
Our proposed framework outperforms the state-of-the-art methods by a significant margin.
arXiv Detail & Related papers (2024-07-26T05:29:24Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Flat Multi-modal Interaction Transformer for Named Entity Recognition [1.7605709999848573]
Multi-modal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images.
We propose a Flat Multi-modal Interaction Transformer (FMIT) for MNER.
We transform the fine-grained semantic representation of the vision and text into a unified lattice structure and design a novel relative position encoding to match different modalities in Transformer.
arXiv Detail & Related papers (2022-08-23T15:25:44Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.