Related papers: 3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

URL: http://arxiv.org/abs/2404.17273v1
Date: Fri, 26 Apr 2024 09:25:18 GMT
Title: 3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Authors: Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose,
Abstract summary: We propose a visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality. Experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet.
Score: 12.770499009990864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at https://github.com/XuriGe1995/3SHNet.

Related papers

Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP [12.96248884328754]
This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition.<n>It uses EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition.<n>It achieves state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score.
arXiv Detail & Related papers (2025-10-22T07:54:18Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment [16.926158907882012]
We propose a unified Vision-Language framework that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network.<n>Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.
arXiv Detail & Related papers (2025-06-16T19:05:33Z)
Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation [27.040017548286812]
Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint.<n>We introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching.
arXiv Detail & Related papers (2025-06-09T00:58:14Z)
SaliencyI2PLoc: saliency-guided image-point cloud localization using contrastive learning [17.29563451509921]
SaliencyI2PLoc is a contrastive learning architecture that fuses the saliency map into feature aggregation. Our method achieves a Recall@1 of 78.92% and a Recall@20 of 97.59% on the urban scenario evaluation dataset.
arXiv Detail & Related papers (2024-12-20T05:20:10Z)
Efficient Semantic Splatting for Remote Sensing Multi-view Segmentation [29.621022493810088]
We propose a novel semantic splatting approach based on Gaussian Splatting to achieve efficient and low-latency. Our method projects the RGB attributes and semantic features of point clouds onto the image plane, simultaneously rendering RGB images and semantic segmentation results.
arXiv Detail & Related papers (2024-12-08T15:28:30Z)
EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene Recognition [27.199124692225777]
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance. We propose EnTri, a framework that employs ensemble learning using a hierarchy of visual features. EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches.
arXiv Detail & Related papers (2023-07-23T22:11:23Z)
DCN-T: Dual Context Network with Transformer for Hyperspectral Image Classification [109.09061514799413]
Hyperspectral image (HSI) classification is challenging due to spatial variability caused by complex imaging conditions. We propose a tri-spectral image generation pipeline that transforms HSI into high-quality tri-spectral images. Our proposed method outperforms state-of-the-art methods for HSI classification.
arXiv Detail & Related papers (2023-04-19T18:32:52Z)
Domain Adaptive Semantic Segmentation by Optimal Transport [13.133890240271308]
semantic scene segmentation has received a great deal of attention due to the richness of the semantic information it contains. Current approaches are mainly based on convolutional neural networks (CNN), but they rely on a large number of labels. We propose a domain adaptation (DA) framework based on optimal transport (OT) and attention mechanism to address this issue.
arXiv Detail & Related papers (2023-03-29T03:33:54Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation [90.87105131054419]
We present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences. We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains.
arXiv Detail & Related papers (2020-12-19T21:18:03Z)
Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z)
GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions. The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN) TERAN enforces a fine-grained match between the underlying components of images and sentences. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z)
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval [43.41089241581596]
We study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images. We propose a unified end-to-end trainable network, termed as ExchNet. Our proposal consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets.
arXiv Detail & Related papers (2020-08-04T07:01:32Z)
Learning to Predict Context-adaptive Convolution for Semantic Segmentation [66.27139797427147]
Long-range contextual information is essential for achieving high-performance semantic segmentation. We propose a Context-adaptive Convolution Network (CaC-Net) to predict a spatially-varying feature weighting vector. Our CaC-Net achieves superior segmentation performance on three public datasets.
arXiv Detail & Related papers (2020-04-17T13:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.