Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks
- URL: http://arxiv.org/abs/2511.19856v1
- Date: Tue, 25 Nov 2025 02:35:48 GMT
- Title: Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks
- Authors: Xiangkai Ma, Han Zhang, Wenzhong Li, Sanglu Lu,
- Abstract summary: TimeArtist is a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts.<n>Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.
- Score: 19.299293037292113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.
Related papers
- TimeOmni-VL: Unified Models for Time Series Understanding and Generation [66.55423802406078]
Time Omni-VL is a vision-centric framework that unifies time series understanding and generation.<n>Time Omni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation.<n> Experiments confirm that this unified approach significantly improves both semantic understanding and numerical precision.
arXiv Detail & Related papers (2026-02-19T07:50:11Z) - iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z) - Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z) - Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers [49.07665715422702]
We propose Time Vision Transformer (TiViT), a framework that converts time series into images.<n>We show that TiViT achieves state-of-the-art performance on standard time series classification benchmarks.<n>Our findings reveal a new direction for reusing vision representations in a non-visual domain.
arXiv Detail & Related papers (2025-06-10T09:54:51Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - Grid: Omni Visual Generation [34.57101244093434]
Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators.<n>We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding.<n>We introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences.
arXiv Detail & Related papers (2024-12-14T07:22:03Z) - Temporal Embeddings: Scalable Self-Supervised Temporal Representation
Learning from Spatiotemporal Data for Multimodal Computer Vision [1.4127889233510498]
A novel approach is proposed to stratify landscape based on mobility activity time series.
The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling.
arXiv Detail & Related papers (2023-10-16T02:53:29Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Leveraging Image-based Generative Adversarial Networks for Time Series
Generation [4.541582055558865]
We propose a two-dimensional image representation for time series, the Extended Intertemporal Return Plot (XIRP)
Our approach captures the intertemporal time series dynamics in a scale-invariant and invertible way, reducing training time and improving sample quality.
arXiv Detail & Related papers (2021-12-15T11:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.