L2C: Describing Visual Differences Needs Semantic Understanding of
Individuals
- URL: http://arxiv.org/abs/2102.01860v1
- Date: Wed, 3 Feb 2021 03:44:42 GMT
- Title: L2C: Describing Visual Differences Needs Semantic Understanding of
Individuals
- Authors: An Yan, Xin Eric Wang, Tsu-Jui Fu, William Yang Wang
- Abstract summary: We introduce a Learning-to-Compare model, which learns to understand the semantic structures of two images and compare them while learning to describe each one.
We demonstrate that L2C benefits from a comparison between explicit semantic representations and single-image captions, and generalizes better on the new testing image pairs.
- Score: 65.87728481187625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in language and vision push forward the research of
captioning a single image to describing visual differences between image pairs.
Suppose there are two images, I_1 and I_2, and the task is to generate a
description W_{1,2} comparing them, existing methods directly model { I_1, I_2
} -> W_{1,2} mapping without the semantic understanding of individuals. In this
paper, we introduce a Learning-to-Compare (L2C) model, which learns to
understand the semantic structures of these two images and compare them while
learning to describe each one. We demonstrate that L2C benefits from a
comparison between explicit semantic representations and single-image captions,
and generalizes better on the new testing image pairs. It outperforms the
baseline on both automatic evaluation and human evaluation for the
Birds-to-Words dataset.
Related papers
- Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Describing Differences in Image Sets with Natural Language [101.80939666230168]
Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets.
We introduce VisDiff, which first captions the images and prompts a language model to propose difference descriptions.
We are able to find interesting and previously unknown differences in datasets and models, demonstrating VisDiff's utility in revealing nuanced insights.
arXiv Detail & Related papers (2023-12-05T18:59:16Z) - Semantically-Prompted Language Models Improve Visual Descriptions [12.267513953980092]
We propose V-GLOSS: Visual Glosses, a novel method for generating expressive visual descriptions.
We show that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets.
arXiv Detail & Related papers (2023-06-05T17:22:54Z) - DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image
Models [53.29993651680099]
We show that DALLE-2 does not follow the constraint that each word has a single role in the interpretation.
We show that DALLE-2 depicts both senses of nouns with multiple senses at once.
arXiv Detail & Related papers (2022-10-19T14:52:40Z) - Image Difference Captioning with Pre-training and Contrastive Learning [45.59621065755761]
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.
The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations.
We propose a new modeling framework following the pre-training-finetuning paradigm to address these challenges.
arXiv Detail & Related papers (2022-02-09T06:14:22Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.