SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
- URL: http://arxiv.org/abs/2412.02565v2
- Date: Fri, 06 Dec 2024 07:08:56 GMT
- Title: SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
- Authors: Joongwon Chae, Zhenyu Wang, Peiwu Qin,
- Abstract summary: This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection.<n>It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space.<n>We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
- Score: 4.930667479611019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By utilizing normalized coordinate detection for bounding boxes and transforming them into actionable segmentation outputs, we establish a connection between spatial and language representations in multimodal architectures. Experimental results demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU with 512x512 resolution images yields an average inference time of 7 seconds per image, demonstrating the framework's effectiveness in both accuracy and practical deployability. The project code is available at https://github.com/jw-chae/SJTU
Related papers
- Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction [4.692621855184482]
Single-Domain Generalized Object Detection(S-DGOD) aims to train an object detector on a single source domain.
Recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains.
We propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks.
arXiv Detail & Related papers (2025-04-27T02:55:54Z) - Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications [0.0]
We propose a novel Context-Aware Semantic framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones.
A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively.
This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.
arXiv Detail & Related papers (2025-03-25T02:12:35Z) - Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.
It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.
The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.
Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.
We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization [2.733505168507872]
Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images.
Existing methods often overlook the problem of increased computational and storage requirements when improving model performance.
We propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN)
arXiv Detail & Related papers (2024-12-19T13:10:38Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [9.109484087832058]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.
To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)
To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment [39.54689489555342]
Current vision-injected (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z) - Framework-agnostic Semantically-aware Global Reasoning for Segmentation [29.69187816377079]
We propose a component that learns to project image features into latent representations and reason between them.
Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint.
Our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks.
arXiv Detail & Related papers (2022-12-06T21:42:05Z) - Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
Expression Comprehension [21.000045864213327]
referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning.
How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task.
We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability.
arXiv Detail & Related papers (2022-04-21T08:32:47Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - BURT: BERT-inspired Universal Representation from Learning Meaningful
Segment [46.51685959045527]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present a universal representation model, BURT, to encode different levels of linguistic unit into the same vector space.
Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage.
arXiv Detail & Related papers (2020-12-28T16:02:28Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.