PhraseCut: Language-based Image Segmentation in the Wild
- URL: http://arxiv.org/abs/2008.01187v1
- Date: Mon, 3 Aug 2020 20:58:53 GMT
- Title: PhraseCut: Language-based Image Segmentation in the Wild
- Authors: Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, Subhransu Maji
- Abstract summary: We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
- Score: 62.643450401286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of segmenting image regions given a natural language
phrase, and study it on a novel dataset of 77,262 images and 345,486
phrase-region pairs. Our dataset is collected on top of the Visual Genome
dataset and uses the existing annotations to generate a challenging set of
referring phrases for which the corresponding regions are manually annotated.
Phrases in our dataset correspond to multiple regions and describe a large
number of object and stuff categories as well as their attributes such as
color, shape, parts, and relationships with other entities in the image. Our
experiments show that the scale and diversity of concepts in our dataset poses
significant challenges to the existing state-of-the-art. We systematically
handle the long-tail nature of these concepts and present a modular approach to
combine category, attribute, and relationship cues that outperforms existing
approaches.
Related papers
- Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts.
Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z) - Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents [17.672677325827454]
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations.
We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines.
The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.
arXiv Detail & Related papers (2020-10-30T16:39:49Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language
Queries at Phrase Level [6.47137925955334]
We propose to utilize spatial attention networks for image-level visual-textual fusion.
We refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query.
For such referring expression dataset ReferIt, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art.
arXiv Detail & Related papers (2020-06-06T04:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.