Visual Information Guided Zero-Shot Paraphrase Generation
- URL: http://arxiv.org/abs/2201.09107v1
- Date: Sat, 22 Jan 2022 18:10:39 GMT
- Title: Visual Information Guided Zero-Shot Paraphrase Generation
- Authors: Zhe Lin and Xiaojun Wan
- Abstract summary: We propose visual information guided zero-shot paraphrase generation (ViPG) based only on paired image-caption data.
It jointly trains an image captioning model and a paraphrasing model and leverage the image captioning model to guide the training of the paraphrasing model.
Both automatic evaluation and human evaluation show our model can generate paraphrase with good relevancy, fluency and diversity.
- Score: 71.33405403748237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot paraphrase generation has drawn much attention as the large-scale
high-quality paraphrase corpus is limited. Back-translation, also known as the
pivot-based method, is typical to this end. Several works leverage different
information as "pivot" such as language, semantic representation and so on. In
this paper, we explore using visual information such as image as the "pivot" of
back-translation. Different with the pipeline back-translation method, we
propose visual information guided zero-shot paraphrase generation (ViPG) based
only on paired image-caption data. It jointly trains an image captioning model
and a paraphrasing model and leverage the image captioning model to guide the
training of the paraphrasing model. Both automatic evaluation and human
evaluation show our model can generate paraphrase with good relevancy, fluency
and diversity, and image is a promising kind of pivot for zero-shot paraphrase
generation.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z) - Vision Transformer Based Model for Describing a Set of Images as a Story [26.717033245063092]
We propose a novel Vision Transformer Based Model for describing a set of images as a story.
The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT)
The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST)
arXiv Detail & Related papers (2022-10-06T09:01:50Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.