Related papers: Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

URL: http://arxiv.org/abs/2308.09455v1
Date: Fri, 18 Aug 2023 10:40:25 GMT
Title: Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Authors: Yeming Chen, Siyu Zhang, Yaoru Sun, Weijian Liang, Haoran Wang
Abstract summary: State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. We propose an efficient framework for multimodal alignment by introducing a novel visual semantic module. Experiments show that the proposed ASH-Nets achieve competitive results.
Score: 16.902924543372713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pretraining. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a nonnegligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial neural networks (ANNs) and Spiking neural networks (SNNs) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract encoder are constructed to learn continuous and discrete latent variables to enhance the flexibility of semantic encoding. Considering the spatio-temporal properties of SNNs modeling, we introduce a contrastive learning method to optimize the inputs of similar samples. This can improve the computational efficiency of the hierarchical network, while the augmentation of hard samples is beneficial to the learning of visual representations. Furthermore, the Spiking to Text Uni-Alignment Learning (STUA) pre-training method is proposed, which only relies on text features to enhance the encoding ability of abstract semantics. We validate the performance on multiple well-established downstream VL tasks. Experiments show that the proposed ASH-Nets achieve competitive results.

Related papers

Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation [2.717521115234258]
Semi-supervised medical image segmentation (SSMIS) offers a viable alternative to CNNs and ViTs. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN. KANs exhibit superior representation learning capabilities with fewer parameters. We show that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost.
arXiv Detail & Related papers (2025-03-19T08:27:41Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
Improving vision-language alignment with graph spiking hybrid Networks [10.88584928028832]
This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate fine-grained semantic features. We propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information.
arXiv Detail & Related papers (2025-01-31T11:55:17Z)
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
Building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues. We leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data. Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.
arXiv Detail & Related papers (2024-12-11T08:03:35Z)
Convergence Analysis for Deep Sparse Coding via Convolutional Neural Networks [7.956678963695681]
We introduce a novel class of Deep Sparse Coding (DSC) models. We derive convergence rates for CNNs in their ability to extract sparse features. Inspired by the strong connection between sparse coding and CNNs, we explore training strategies to encourage neural networks to learn more sparse features.
arXiv Detail & Related papers (2024-08-10T12:43:55Z)
Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z)
AMOSL: Adaptive Modality-wise Structure Learning in Multi-view Graph Neural Networks For Enhanced Unified Representation [22.84527318463151]
Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation. Existing methods assume identical local topology structures across modalities that overlook real-world discrepancies. We propose adaptive modality-wise structure learning (AMoSL) to address these issues.
arXiv Detail & Related papers (2024-06-04T14:24:30Z)
Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks. It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios. The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z)
SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network [39.54624592783459]
Spiking Neural Networks (SNNs) have emerged as a promising alternative to conventional Artificial Neural Networks (ANNs) This paper presents SpikeCLIP, a novel framework designed to bridge the modality gap in spike-based computation.
arXiv Detail & Related papers (2023-10-10T09:57:17Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z)
Adaptive Explainable Neural Networks (AxNNs) [8.949704905866888]
We develop a new framework called Adaptive Explainable Neural Networks (AxNN) for achieving the dual goals of good predictive performance and model interpretability. For predictive performance, we build a structured neural network made up of ensembles of generalized additive model networks and additive index models. For interpretability, we show how to decompose the results of AxNN into main effects and higher-order interaction effects.
arXiv Detail & Related papers (2020-04-05T23:40:57Z)
Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.