StreaMulT: Streaming Multimodal Transformer for Heterogeneous and
Arbitrary Long Sequential Data
- URL: http://arxiv.org/abs/2110.08021v2
- Date: Wed, 21 Feb 2024 21:48:55 GMT
- Title: StreaMulT: Streaming Multimodal Transformer for Heterogeneous and
Arbitrary Long Sequential Data
- Authors: Victor Pellegrain (1 and 2), Myriam Tami (2), Michel Batteux (1),
C\'eline Hudelot (2) ((1) Institut de Recherche Technologique SystemX, (2)
Universit\'e Paris-Saclay, CentraleSup\'elec, MICS)
- Abstract summary: StreaMulT is a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference.
StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing complexity of Industry 4.0 systems brings new challenges
regarding predictive maintenance tasks such as fault detection and diagnosis. A
corresponding and realistic setting includes multi-source data streams from
different modalities, such as sensors measurements time series, machine images,
textual maintenance reports, etc. These heterogeneous multimodal streams also
differ in their acquisition frequency, may embed temporally unaligned
information and can be arbitrarily long, depending on the considered system and
task. Whereas multimodal fusion has been largely studied in a static setting,
to the best of our knowledge, there exists no previous work considering
arbitrarily long multimodal streams alongside with related tasks such as
prediction across time. Thus, in this paper, we first formalize this paradigm
of heterogeneous multimodal learning in a streaming setting as a new one. To
tackle this challenge, we propose StreaMulT, a Streaming Multimodal Transformer
relying on cross-modal attention and on a memory bank to process arbitrarily
long input sequences at training time and run in a streaming way at inference.
StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for
Multimodal Sentiment Analysis task, while being able to deal with much longer
inputs than other multimodal models. The conducted experiments eventually
highlight the importance of the textual embedding layer, questioning recent
improvements in Multimodal Sentiment Analysis benchmarks.
Related papers
- See it, Think it, Sorted: Large Multimodal Models are Few-shot Time Series Anomaly Analyzers [23.701716999879636]
Time series anomaly detection (TSAD) is becoming increasingly vital due to the rapid growth of time series data.
We introduce a pioneering framework called the Time Series Anomaly Multimodal Analyzer (TAMA) to enhance both the detection and interpretation of anomalies.
arXiv Detail & Related papers (2024-11-04T10:28:41Z) - DRFormer: Multi-Scale Transformer Utilizing Diverse Receptive Fields for Long Time-Series Forecasting [3.420673126033772]
We propose a dynamic tokenizer with a dynamic sparse learning algorithm to capture diverse receptive fields and sparse patterns of time series data.
Our proposed model, named DRFormer, is evaluated on various real-world datasets, and experimental results demonstrate its superiority compared to existing methods.
arXiv Detail & Related papers (2024-08-05T07:26:47Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks [31.59812777504438]
We present MultiModN, a network that fuses latent representations in a sequence of any number, combination, or type of modality.
We show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion.
arXiv Detail & Related papers (2023-09-25T13:16:57Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Ti-MAE: Self-Supervised Masked Time Series Autoencoders [16.98069693152999]
We propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution.
Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level.
Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data.
arXiv Detail & Related papers (2023-01-21T03:20:23Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Multi-scale Attention Flow for Probabilistic Time Series Forecasting [68.20798558048678]
We propose a novel non-autoregressive deep learning model, called Multi-scale Attention Normalizing Flow(MANF)
Our model avoids the influence of cumulative error and does not increase the time complexity.
Our model achieves state-of-the-art performance on many popular multivariate datasets.
arXiv Detail & Related papers (2022-05-16T07:53:42Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.