SLAN: Self-Locator Aided Network for Cross-Modal Understanding
- URL: http://arxiv.org/abs/2211.16208v1
- Date: Mon, 28 Nov 2022 11:42:23 GMT
- Title: SLAN: Self-Locator Aided Network for Cross-Modal Understanding
- Authors: Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo
Ren, Ming-Ming Cheng
- Abstract summary: We propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks.
SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts.
It achieves fairly competitive results on five cross-modal understanding tasks.
- Score: 89.20623874655352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning fine-grained interplay between vision and language allows to a more
accurate understanding for VisionLanguage tasks. However, it remains
challenging to extract key image regions according to the texts for semantic
alignments. Most existing works are either limited by textagnostic and
redundant regions obtained with the frozen detectors, or failing to scale
further due to its heavy reliance on scarce grounding (gold) data to pre-train
detectors. To solve these problems, we propose Self-Locator Aided Network
(SLAN) for cross-modal understanding tasks without any extra gold data. SLAN
consists of a region filter and a region adaptor to localize regions of
interest conditioned on different texts. By aggregating cross-modal
information, the region filter selects key regions and the region adaptor
updates their coordinates with text guidance. With detailed region-word
alignments, SLAN can be easily generalized to many downstream tasks. It
achieves fairly competitive results on five cross-modal understanding tasks
(e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval,
surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and
fine-tuned transferability to two localization tasks.
Related papers
- MENTOR: Multilingual tExt detectioN TOward leaRning by analogy [59.37382045577384]
We propose a framework to detect and identify both seen and unseen language regions inside scene images.
"MENTOR" is the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
arXiv Detail & Related papers (2024-03-12T03:35:17Z) - SDPL: Shifting-Dense Partition Learning for UAV-View Geo-Localization [27.131867916908156]
Cross-view geo-localization aims to match images of the same target from different platforms.
We introduce part-based representation learning, shifting-dense partition learning.
We show that SDPL is robust to position shifting, and performs com-petitively on two prevailing benchmarks.
arXiv Detail & Related papers (2024-03-07T03:07:54Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - A Transformer-Based Feature Segmentation and Region Alignment Method For
UAV-View Geo-Localization [0.5257115841810257]
Cross-view geo-localization is a task of matching the same geographic image from different views.
Existing methods are mainly aimed at digging for more comprehensive fine-grained information.
We introduce a simple and efficient transformer-based structure called Feature and Region Alignment (FSRA) to enhance the model's ability to understand contextual information.
arXiv Detail & Related papers (2022-01-23T08:01:42Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language
Queries at Phrase Level [6.47137925955334]
We propose to utilize spatial attention networks for image-level visual-textual fusion.
We refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query.
For such referring expression dataset ReferIt, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art.
arXiv Detail & Related papers (2020-06-06T04:14:15Z) - ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene
Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts.
Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.