Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation
- URL: http://arxiv.org/abs/2110.04435v1
- Date: Sat, 9 Oct 2021 02:53:39 GMT
- Title: Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation
- Authors: Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang,
Xiaolin Wei, Lin Ma
- Abstract summary: Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
- Score: 89.49412325699537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Image Segmentation (RIS) aims at segmenting the target object from
an image referred by one given natural language expression. The diverse and
flexible expressions as well as complex visual contents in the images raise the
RIS model with higher demands for investigating fine-grained matching behaviors
between words in expressions and objects presented in images. However, such
matching behaviors are hard to be learned and captured when the visual cues of
referents (i.e. referred objects) are insufficient, as the referents with weak
visual cues tend to be easily confused by cluttered background at boundary or
even overwhelmed by salient objects in the image. And the insufficient visual
cues issue can not be handled by the cross-modal fusion mechanisms as done in
previous work. In this paper, we tackle this problem from a novel perspective
of enhancing the visual information for the referents by devising a Two-stage
Visual cues enhancement Network (TV-Net), where a novel Retrieval and
Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF)
module are proposed. Through the two-stage enhancement, our proposed TV-Net
enjoys better performances in learning fine-grained matching behaviors between
the natural language expression and image, especially when the visual
information of the referent is inadequate, thus produces better segmentation
results. Extensive experiments are conducted to validate the effectiveness of
the proposed method on the RIS task, with our proposed TV-Net surpassing the
state-of-the-art approaches on four benchmark datasets.
Related papers
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Towards Generalizable Referring Image Segmentation via Target Prompt and
Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions.
We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above.
Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - LatteGAN: Visually Guided Language Attention for Multi-Turn
Text-Conditioned Image Manipulation [0.0]
We present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN)
LatteGAN extracts fine-grained text representations for the generator, and discriminates both the global and local representations of fake or real images.
Experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
arXiv Detail & Related papers (2021-12-28T03:50:03Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.