Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning
- URL: http://arxiv.org/abs/2308.09455v1
- Date: Fri, 18 Aug 2023 10:40:25 GMT
- Title: Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning
- Authors: Yeming Chen, Siyu Zhang, Yaoru Sun, Weijian Liang, Haoran Wang
- Abstract summary: State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets.
We propose an efficient framework for multimodal alignment by introducing a novel visual semantic module.
Experiments show that the proposed ASH-Nets achieve competitive results.
- Score: 16.902924543372713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of self-supervised learning, multimodal foundation models
have rapidly adapted a wide range of downstream tasks driven by vision and
language (VL) pretraining. State-of-the-art methods achieve impressive
performance by pre-training on large-scale datasets. However, bridging the
semantic gap between the two modalities remains a nonnegligible challenge for
VL tasks. In this work, we propose an efficient computation framework for
multimodal alignment by introducing a novel visual semantic module to further
improve the performance of the VL tasks. Specifically, we propose a flexible
model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which
combines the complementary advantages of Artificial neural networks (ANNs) and
Spiking neural networks (SNNs) to enrich visual semantic representations. In
particular, a visual concrete encoder and a semantic abstract encoder are
constructed to learn continuous and discrete latent variables to enhance the
flexibility of semantic encoding. Considering the spatio-temporal properties of
SNNs modeling, we introduce a contrastive learning method to optimize the
inputs of similar samples. This can improve the computational efficiency of the
hierarchical network, while the augmentation of hard samples is beneficial to
the learning of visual representations. Furthermore, the Spiking to Text
Uni-Alignment Learning (STUA) pre-training method is proposed, which only
relies on text features to enhance the encoding ability of abstract semantics.
We validate the performance on multiple well-established downstream VL tasks.
Experiments show that the proposed ASH-Nets achieve competitive results.
Related papers
- Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - AMOSL: Adaptive Modality-wise Structure Learning in Multi-view Graph Neural Networks For Enhanced Unified Representation [22.84527318463151]
Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation.
Existing methods assume identical local topology structures across modalities that overlook real-world discrepancies.
We propose adaptive modality-wise structure learning (AMoSL) to address these issues.
arXiv Detail & Related papers (2024-06-04T14:24:30Z) - Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks.
It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios.
The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Learning to Relate Depth and Semantics for Unsupervised Domain
Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting.
We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions.
Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z) - Active Learning in CNNs via Expected Improvement Maximization [2.0305676256390934]
"Dropout-based IMprOvementS" (DEIMOS) is a flexible and computationally-efficient approach to active learning.
Our results demonstrate that DEIMOS outperforms several existing baselines across multiple regression and classification tasks.
arXiv Detail & Related papers (2020-11-27T22:06:52Z) - Adaptive Explainable Neural Networks (AxNNs) [8.949704905866888]
We develop a new framework called Adaptive Explainable Neural Networks (AxNN) for achieving the dual goals of good predictive performance and model interpretability.
For predictive performance, we build a structured neural network made up of ensembles of generalized additive model networks and additive index models.
For interpretability, we show how to decompose the results of AxNN into main effects and higher-order interaction effects.
arXiv Detail & Related papers (2020-04-05T23:40:57Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.