Related papers: seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

URL: http://arxiv.org/abs/2505.03176v2
Date: Thu, 22 May 2025 06:37:05 GMT
Title: seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
Authors: Hafez Ghaemi, Eilif Muller, Shahab Bakhtiari,
Abstract summary: We propose seq-JEPA, a world modeling framework that introduces architectural biases into joint-embedding predictive architectures.<n>Seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to specified transformations and another invariant to them.<n>It excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.
Score: 1.474723404975345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current self-supervised algorithms commonly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by enforcing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm often limits the flexibility of learned representations for downstream adaptation by creating performance trade-offs between high-level invariance-demanding tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we proposes \emph{seq-JEPA}, a world modeling framework that introduces architectural inductive biases into joint-embedding predictive architectures to resolve this trade-off. Without relying on dual equivariance predictors or loss terms, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to specified transformations and another invariant to them. To do so, our model processes short sequences of different views (observations) of inputs. Each encoded view is concatenated with an embedding of the relative transformation (action) that produces the next observation in the sequence. These view-action pairs are passed through a transformer encoder that outputs an aggregate representation. A predictor head then conditions this aggregate representation on the upcoming action to predict the representation of the next observation. Empirically, seq-JEPA demonstrates strong performance on both equivariant and invariant benchmarks without sacrificing one for the other. Furthermore, it excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

Related papers

Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study [2.4405762029252465]
Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes.<n>We provide the technical foundation for leveraging semantic pairs to enhance the generalizability of the model's representation.
arXiv Detail & Related papers (2025-10-09T18:31:55Z)
Self-Supervised Learning based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation [3.7622885602373626]
We propose a self-supervised learning approach to learning computer vision features.<n>The system learns transformations independently by reconstructing images that have undergone previously unseen transformations.<n>Our approach performs strong on a rich set of realistic computer vision downstream tasks, almost always improving over all baselines.
arXiv Detail & Related papers (2025-03-24T15:01:50Z)
Self-supervised Transformation Learning for Equivariant Representations [26.207358743969277]
Unsupervised representation learning has significantly advanced various machine learning tasks.<n>We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs.<n>We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks.
arXiv Detail & Related papers (2025-01-15T10:54:21Z)
Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components. Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z)
PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE enhances global feature representation of point cloud masked autoencoders by making them both discriminative and sensitive to transformations.<n>We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations.
arXiv Detail & Related papers (2024-09-24T07:57:21Z)
In-Context Symmetries: Self-Supervised Learning through Contextual World Models [41.61360016455319]
We propose to learn a general representation that can adapt to be invariant or equivariant to different transformations by paying attention to context. Our proposed algorithm, Contextual Self-Supervised Learning (ContextSSL), learns equivariance to all transformations.
arXiv Detail & Related papers (2024-05-28T14:03:52Z)
EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer. It provides a unified theoretical framework to formulate both semantic difference and positional difference. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Self-supervised learning of Split Invariant Equivariant representations [0.0]
We introduce 3DIEBench, consisting of renderings from 3D models over 55 classes and more than 2.5 million images where we have full control on the transformations applied to the objects. We introduce a predictor architecture based on hypernetworks to learn equivariant representations with no possible collapse to invariance. We introduce SIE (Split Invariant-Equivariant) which combines the hypernetwork-based predictor with representations split in two parts, one invariant, the other equivariant, to learn richer representations.
arXiv Detail & Related papers (2023-02-14T07:53:18Z)
CIPER: Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning [6.117084972237769]
We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER) CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. We evaluate our method on static image tasks and time-augmented image datasets.
arXiv Detail & Related papers (2023-02-05T07:50:46Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)
Frame Averaging for Invariant and Equivariant Network Design [50.87023773850824]
We introduce Frame Averaging (FA), a framework for adapting known (backbone) architectures to become invariant or equivariant to new symmetry types. We show that FA-based models have maximal expressive power in a broad setting. We propose a new class of universal Graph Neural Networks (GNNs), universal Euclidean motion invariant point cloud networks, and Euclidean motion invariant Message Passing (MP) GNNs.
arXiv Detail & Related papers (2021-10-07T11:05:23Z)
Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium Segmentation [0.0]
We present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views. Our method has achieved a competitive result over the state-of-the-art semisupervised methods on the Atrial Challenge dataset.
arXiv Detail & Related papers (2021-09-20T14:51:42Z)
Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z)
Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.