Dual Convolutional LSTM Network for Referring Image Segmentation
- URL: http://arxiv.org/abs/2001.11561v1
- Date: Thu, 30 Jan 2020 20:40:18 GMT
- Title: Dual Convolutional LSTM Network for Referring Image Segmentation
- Authors: Linwei Ye, Zhi Liu, Yang Wang
- Abstract summary: referring image segmentation is a problem at the intersection of computer vision and natural language understanding.
We propose a dual convolutional LSTM (ConvLSTM) network to tackle this problem.
- Score: 18.181286443737417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider referring image segmentation. It is a problem at the intersection
of computer vision and natural language understanding. Given an input image and
a referring expression in the form of a natural language sentence, the goal is
to segment the object of interest in the image referred by the linguistic
query. To this end, we propose a dual convolutional LSTM (ConvLSTM) network to
tackle this problem. Our model consists of an encoder network and a decoder
network, where ConvLSTM is used in both encoder and decoder networks to capture
spatial and sequential information. The encoder network extracts visual and
linguistic features for each word in the expression sentence, and adopts an
attention mechanism to focus on words that are more informative in the
multimodal interaction. The decoder network integrates the features generated
by the encoder network at multiple levels as its input and produces the final
precise segmentation mask. Experimental results on four challenging datasets
demonstrate that the proposed network achieves superior segmentation
performance compared with other state-of-the-art methods.
Related papers
- Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - UFO: A UniFied TransfOrmer for Vision-Language Representation Learning [54.82482779792115]
We propose a single UniFied transfOrmer (UFO) capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question) for vision-language (VL) representation learning.
Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks.
arXiv Detail & Related papers (2021-11-19T03:23:10Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Empirical Analysis of Image Caption Generation using Deep Learning [0.0]
We have implemented and experimented with various flavors of multi-modal image captioning networks.
The goal is to analyze the performance of each approach using various evaluation metrics.
arXiv Detail & Related papers (2021-05-14T05:38:13Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for
Fine-Resolution Remote Sensing Images [6.171417925832851]
We introduce the Swin Transformer as the backbone to fully extract the context information.
We also design a novel decoder named densely connected feature aggregation module (DCFAM) to restore the resolution and generate the segmentation map.
arXiv Detail & Related papers (2021-04-25T11:34:22Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.