METEOR: Learning Memory and Time Efficient Representations from
Multi-modal Data Streams
- URL: http://arxiv.org/abs/2007.11847v1
- Date: Thu, 23 Jul 2020 08:18:02 GMT
- Title: METEOR: Learning Memory and Time Efficient Representations from
Multi-modal Data Streams
- Authors: Amila Silva, Shanika Karunasekera, Christopher Leckie, Ling Luo
- Abstract summary: We present METEOR, a novel MEmory and Time Efficient Online Representation learning technique.
We show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
- Score: 19.22829945777267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many learning tasks involve multi-modal data streams, where continuous data
from different modes convey a comprehensive description about objects. A major
challenge in this context is how to efficiently interpret multi-modal
information in complex environments. This has motivated numerous studies on
learning unsupervised representations from multi-modal data streams. These
studies aim to understand higher-level contextual information (e.g., a Twitter
message) by jointly learning embeddings for the lower-level semantic units in
different modalities (e.g., text, user, and location of a Twitter message).
However, these methods directly associate each low-level semantic unit with a
continuous embedding vector, which results in high memory requirements. Hence,
deploying and continuously learning such models in low-memory devices (e.g.,
mobile devices) becomes a problem. To address this problem, we present METEOR,
a novel MEmory and Time Efficient Online Representation learning technique,
which: (1) learns compact representations for multi-modal data by sharing
parameters within semantically meaningful groups and preserves the
domain-agnostic semantics; (2) can be accelerated using parallel processes to
accommodate different stream rates while capturing the temporal changes of the
units; and (3) can be easily extended to capture implicit/explicit external
knowledge related to multi-modal data streams. We evaluate METEOR using two
types of multi-modal data streams (i.e., social media streams and shopping
transaction streams) to demonstrate its ability to adapt to different domains.
Our results show that METEOR preserves the quality of the representations while
reducing memory usage by around 80% compared to the conventional
memory-intensive embeddings.
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Exemplar Masking for Multimodal Incremental Learning [47.18796033633918]
Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge.
In this paper, we propose the exemplar masking framework to efficiently replay old knowledge.
We show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer.
arXiv Detail & Related papers (2024-12-12T18:40:20Z) - Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review [1.8590097948961688]
Generative AI such as Large Language Models (LLMs) sees broad adoption to process multi-modal data such as text, images, audio, and video.
Managing this data efficiently has become a significant practical challenge in the industry-double as much data is not double as good.
This study focuses on the different semantic-aware techniques to extract embeddings from mono-modal, multi-modal, and cross-modal data.
arXiv Detail & Related papers (2024-07-17T09:49:11Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.