Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning
- URL: http://arxiv.org/abs/2308.09455v1
- Date: Fri, 18 Aug 2023 10:40:25 GMT
- Title: Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning
- Authors: Yeming Chen, Siyu Zhang, Yaoru Sun, Weijian Liang, Haoran Wang
- Abstract summary: State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets.
We propose an efficient framework for multimodal alignment by introducing a novel visual semantic module.
Experiments show that the proposed ASH-Nets achieve competitive results.
- Score: 16.902924543372713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of self-supervised learning, multimodal foundation models
have rapidly adapted a wide range of downstream tasks driven by vision and
language (VL) pretraining. State-of-the-art methods achieve impressive
performance by pre-training on large-scale datasets. However, bridging the
semantic gap between the two modalities remains a nonnegligible challenge for
VL tasks. In this work, we propose an efficient computation framework for
multimodal alignment by introducing a novel visual semantic module to further
improve the performance of the VL tasks. Specifically, we propose a flexible
model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which
combines the complementary advantages of Artificial neural networks (ANNs) and
Spiking neural networks (SNNs) to enrich visual semantic representations. In
particular, a visual concrete encoder and a semantic abstract encoder are
constructed to learn continuous and discrete latent variables to enhance the
flexibility of semantic encoding. Considering the spatio-temporal properties of
SNNs modeling, we introduce a contrastive learning method to optimize the
inputs of similar samples. This can improve the computational efficiency of the
hierarchical network, while the augmentation of hard samples is beneficial to
the learning of visual representations. Furthermore, the Spiking to Text
Uni-Alignment Learning (STUA) pre-training method is proposed, which only
relies on text features to enhance the encoding ability of abstract semantics.
We validate the performance on multiple well-established downstream VL tasks.
Experiments show that the proposed ASH-Nets achieve competitive results.
Related papers
- Improving vision-language alignment with graph spiking hybrid Networks [6.707524980629404]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features.
We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z) - Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
Building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues.
We leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data.
Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.
arXiv Detail & Related papers (2024-12-11T08:03:35Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Convergence Analysis for Deep Sparse Coding via Convolutional Neural Networks [7.956678963695681]
We explore intersections between sparse coding and deep learning to enhance our understanding of feature extraction capabilities.
We derive convergence rates for convolutional neural networks (CNNs) in their ability to extract sparse features.
Inspired by the strong connection between sparse coding and CNNs, we explore training strategies to encourage neural networks to learn more sparse features.
arXiv Detail & Related papers (2024-08-10T12:43:55Z) - Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - AMOSL: Adaptive Modality-wise Structure Learning in Multi-view Graph Neural Networks For Enhanced Unified Representation [22.84527318463151]
Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation.
Existing methods assume identical local topology structures across modalities that overlook real-world discrepancies.
We propose adaptive modality-wise structure learning (AMoSL) to address these issues.
arXiv Detail & Related papers (2024-06-04T14:24:30Z) - Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks.
It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios.
The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z) - SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network [39.54624592783459]
Spiking Neural Networks (SNNs) have emerged as a promising alternative to conventional Artificial Neural Networks (ANNs)
This paper presents SpikeCLIP, a novel framework designed to bridge the modality gap in spike-based computation.
arXiv Detail & Related papers (2023-10-10T09:57:17Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Learning to Relate Depth and Semantics for Unsupervised Domain
Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting.
We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions.
Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.