Related papers: Multi-Modal Prototypes for Open-World Semantic Segmentation

Multi-Modal Prototypes for Open-World Semantic Segmentation

URL: http://arxiv.org/abs/2307.02003v3
Date: Thu, 11 Jul 2024 05:14:55 GMT
Title: Multi-Modal Prototypes for Open-World Semantic Segmentation
Authors: Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang,
Abstract summary: We propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for semantic segmentation. We decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes. Based on an elastic mask prediction module, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture.
Score: 37.84805778548119
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-$5^i$ and COCO-$20^i$ datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.

Related papers

Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z)
Enhancing Visual In-Context Learning by Multi-Faceted Fusion [6.852150407828682]
We introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion.<n>Our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts.<n>Experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization.
arXiv Detail & Related papers (2026-01-15T06:25:09Z)
Point Cloud Quantization through Multimodal Prompting for 3D Understanding [20.397232104616574]
We propose a simple multimodal prompting-driven quantization framework for point cloud analysis.<n>Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics.<n>We employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity.
arXiv Detail & Related papers (2025-11-15T07:51:10Z)
Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation [66.82598255715696]
Federated learning enables multiple medical institutions to train a global model without sharing data.<n>Current approaches primarily focus on final-layer features, overlooking critical multi-level cues.<n>We propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment.
arXiv Detail & Related papers (2025-11-14T04:15:34Z)
Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [56.790045049514326]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z)
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding [11.244257545057508]
Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
arXiv Detail & Related papers (2025-09-08T02:27:10Z)
KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model [31.59640895434506]
Keypoints, as structure-aware, pixel-level, and compact representations of objects, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition.<n>In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension.<n>By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration.
arXiv Detail & Related papers (2025-07-15T08:52:28Z)
Dynamic Multimodal Prototype Learning in Vision-Language Models [44.84161970425967]
We introduce textbfProtoMM, a training-free framework that constructs multimodal prototypes to adapt vision-language models during the test time.<n>By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning.
arXiv Detail & Related papers (2025-07-04T15:31:47Z)
Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology [10.811667603360041]
ProAlign is a cross-modal unsupervised slide representation learning framework. We leverage a large language model (LLM) to generate descriptive text for the prototype types present in a whole slide image. We propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings.
arXiv Detail & Related papers (2025-03-26T03:31:07Z)
The Multi-Faceted Monosemanticity in Multimodal Representations [42.64636740703632]
We leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models. Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities. These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.
arXiv Detail & Related papers (2025-02-16T14:51:07Z)
Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples. Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs) For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z)
One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category. We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods explore suitable measures for the similarity between the query and support images. We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z)
Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning. The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition [40.329190454146996]
MultimOdal PRototype-ENhanced Network (MORN) uses semantic information of label texts as multimodal information to enhance prototypes. We conduct extensive experiments on four popular few-shot action recognition datasets.
arXiv Detail & Related papers (2022-12-09T14:24:39Z)
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents. Text reading and information extraction can reinforce each other via a well-designed multi-modal context block. The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z)
Part-aware Prototype Network for Few-shot Semantic Segmentation [50.581647306020095]
We propose a novel few-shot semantic segmentation framework based on the prototype representation. Our key idea is to decompose the holistic class representation into a set of part-aware prototypes. We develop a novel graph neural network model to generate and enhance the proposed part-aware prototypes.
arXiv Detail & Related papers (2020-07-13T11:03:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.