Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering
- URL: http://arxiv.org/abs/2309.04734v1
- Date: Sat, 9 Sep 2023 09:41:36 GMT
- Title: Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering
- Authors: Yifan Dong, Suhang Wu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jianxin
Lin, and Jinsong Su
- Abstract summary: Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
- Score: 79.44443231700201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal keyphrase generation aims to produce a set of keyphrases that
represent the core points of the input text-image pair. In this regard,
dominant methods mainly focus on multi-modal fusion for keyphrase generation.
Nevertheless, there are still two main drawbacks: 1) only a limited number of
sources, such as image captions, can be utilized to provide auxiliary
information. However, they may not be sufficient for the subsequent keyphrase
generation. 2) the input text and image are often not perfectly matched, and
thus the image may introduce noise into the model. To address these
limitations, in this paper, we propose a novel multi-modal keyphrase generation
model, which not only enriches the model input with external knowledge, but
also effectively filters image noise. First, we introduce external visual
entities of the image as the supplementary input to the model, which benefits
the cross-modal semantic alignment for keyphrase generation. Second, we
simultaneously calculate an image-text matching score and image region-text
correlation scores to perform multi-granularity image noise filtering.
Particularly, we introduce the correlation scores between image regions and
ground-truth keyphrases to refine the calculation of the previously-mentioned
correlation scores. To demonstrate the effectiveness of our model, we conduct
several groups of experiments on the benchmark dataset.
Experimental results and in-depth analyses show that our model achieves the
state-of-the-art performance. Our code is available on
https://github.com/DeepLearnXMU/MM-MKP.
Related papers
- Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Multimodal Neural Machine Translation with Search Engine Based Image
Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus.
Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Multi-Image Summarization: Textual Summary from a Set of Cohesive Images [17.688344968462275]
This paper proposes the new task of multi-image summarization.
It aims to generate a concise and descriptive textual summary given a coherent set of input images.
A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
arXiv Detail & Related papers (2020-06-15T18:45:35Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.