Related papers: Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

URL: http://arxiv.org/abs/2010.00515v3
Date: Mon, 5 Oct 2020 08:49:43 GMT
Title: Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Authors: Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han
Abstract summary: We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction. Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
Score: 61.701577239317785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts.

Related papers

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs. Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text. Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z)
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization [49.931663904599205]
Researchers have developed techniques to develop Large Multimodal Models with In-Context Learning capabilities. Existing LMMs fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. We propose Symbol Demonstration Direct Preference Optimization (SymDPO) to break the traditional paradigm of constructing multimodal demonstrations.
arXiv Detail & Related papers (2024-11-17T08:29:14Z)
FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers. We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z)
Shapley Value-based Contrastive Alignment for Multimodal Information Extraction [17.04865437165252]
We introduce a new paradigm of Image-Context-Text interaction. We propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method. Our method significantly outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-07-25T08:15:43Z)
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task. We propose a Multi-modal Context Reasoning approach, named ModCR. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z)
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation. We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z)
Support-set based Multi-modal Representation Enhancement for Video Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z)
CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z)
Comprehensive Multi-Modal Interactions for Referring Image Segmentation [7.064383217512461]
We investigate Referring Image (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. We propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
arXiv Detail & Related papers (2021-04-21T08:45:09Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.