MUTATT: Visual-Textual Mutual Guidance for Referring Expression
Comprehension
- URL: http://arxiv.org/abs/2003.08027v2
- Date: Fri, 20 Mar 2020 05:01:15 GMT
- Title: MUTATT: Visual-Textual Mutual Guidance for Referring Expression
Comprehension
- Authors: Shuai Wang, Fan Lyu, Wei Feng, and Song Wang
- Abstract summary: Referring expression comprehension aims to localize a text-related region in a given image by a referring expression in natural language.
We argue that for REC the referring expression and the target region are semantically correlated.
We propose a novel approach called MutAtt to construct mutual guidance between vision and language.
- Score: 16.66775734538439
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring expression comprehension (REC) aims to localize a text-related
region in a given image by a referring expression in natural language. Existing
methods focus on how to build convincing visual and language representations
independently, which may significantly isolate visual and language information.
In this paper, we argue that for REC the referring expression and the target
region are semantically correlated and subject, location and relationship
consistency exist between vision and language.On top of this, we propose a
novel approach called MutAtt to construct mutual guidance between vision and
language, which treat vision and language equally thus yield compact
information matching. Specifically, for each module of subject, location and
relationship, MutAtt builds two kinds of attention-based mutual guidance
strategies. One strategy is to generate vision-guided language embedding for
the sake of matching relevant visual feature. The other reversely generates
language-guided visual feature to match relevant language embedding. This
mutual guidance strategy can effectively guarantees the vision-language
consistency in three modules. Experiments on three popular REC datasets
demonstrate that the proposed approach outperforms the current state-of-the-art
methods.
Related papers
- Context-Aware Integration of Language and Visual References for Natural Language Tracking [27.3884348078998]
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame.
We propose a joint multi-modal tracking framework with 1) a prompt module to leverage the complement between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues.
This design ensures-temporal consistency by leveraging historical visual information and an integrated solution, generating predictions in a single step.
arXiv Detail & Related papers (2024-03-29T04:58:33Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - RISAM: Referring Image Segmentation via Mutual-Aware Attention Features [13.64992652002458]
Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt.
Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding.
We propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism.
arXiv Detail & Related papers (2023-11-27T11:24:25Z) - VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search [51.9899504535878]
We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search.
In VGSG, a vision-guided attention is employed to extract visual-related textual features.
With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
arXiv Detail & Related papers (2023-11-13T17:56:54Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - Learning Point-Language Hierarchical Alignment for 3D Visual Grounding [35.17185775314988]
This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner.
We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation.
To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme.
arXiv Detail & Related papers (2022-10-22T18:02:10Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.