DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
- URL: http://arxiv.org/abs/2511.17354v1
- Date: Fri, 21 Nov 2025 16:18:50 GMT
- Title: DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
- Authors: Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal,
- Abstract summary: DSeq-JEPA bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning.<n>DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants.
- Score: 34.31498256147088
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.
Related papers
- US-JEPA: A Joint Embedding Predictive Architecture for Medical Ultrasound [5.216814358105614]
Ultrasound (US) imaging poses unique challenges for representation learning due to its inherently noisy acquisition process.<n>Joint-Embedding Predictive Architectures (JEPAs) predict masked latent representations rather than raw pixels.<n>We propose US-JEPA, a self-supervised framework that adopts the Static-teacher Asymmetric Latent Training (SALT) objective.
arXiv Detail & Related papers (2026-02-22T19:56:56Z) - ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning [90.41852663775086]
ACT-JEPA is a novel architecture that integrates imitation learning and self-supervised learning.<n>We train a policy to predict action sequences and abstract observation sequences.<n>Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics.
arXiv Detail & Related papers (2025-01-24T16:41:41Z) - Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning [14.869908713261227]
Contrastive-JEPA integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy.
C-JEPA significantly enhances the stability and quality of visual representation learning.
When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.
arXiv Detail & Related papers (2024-10-25T13:48:12Z) - Towards Distribution-Agnostic Generalized Category Discovery [51.52673017664908]
Data imbalance and open-ended distribution are intrinsic characteristics of the real visual world.
We propose a Self-Balanced Co-Advice contrastive framework (BaCon)
BaCon consists of a contrastive-learning branch and a pseudo-labeling branch, working collaboratively to provide interactive supervision to resolve the DA-GCD task.
arXiv Detail & Related papers (2023-10-02T17:39:58Z) - Graph-level Representation Learning with Joint-Embedding Predictive Architectures [43.89120279424267]
Joint-Embedding Predictive Architectures (JEPAs) have emerged as a novel and powerful technique for self-supervised representation learning.<n>We show that graph-level representations can be effectively modeled using this paradigm by proposing a Graph Joint-Embedding Predictive Architecture (Graph-JEPA)<n>In particular, we employ masked modeling and focus on predicting the latent representations of masked subgraphs starting from the latent representation of a context subgraph.
arXiv Detail & Related papers (2023-09-27T20:42:02Z) - 1st Place Solution for PSG competition with ECCV'22 SenseHuman Workshop [1.5362025549031049]
Panoptic Scene Graph (PSG) generation aims to generate scene graph representations based on panoptic segmentation instead of rigid bounding boxes.
We propose GRNet, a Global Relation Network in two-stage paradigm, where the pre-extracted local object features and their corresponding masks are fed into a transformer with class embeddings.
We conduct comprehensive experiments on OpenPSG dataset and achieve the state-of-art performance on the leadboard.
arXiv Detail & Related papers (2023-02-06T09:47:46Z) - Learning What Not to Segment: A New Perspective on Few-Shot Segmentation [63.910211095033596]
Recently few-shot segmentation (FSS) has been extensively developed.
This paper proposes a fresh and straightforward insight to alleviate the problem.
In light of the unique nature of the proposed approach, we also extend it to a more realistic but challenging setting.
arXiv Detail & Related papers (2022-03-15T03:08:27Z) - Consistency Regularization for Deep Face Anti-Spoofing [69.70647782777051]
Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems.
Motivated by this exciting observation, we conjecture that encouraging feature consistency of different views may be a promising way to boost FAS models.
We enhance both Embedding-level and Prediction-level Consistency Regularization (EPCR) in FAS.
arXiv Detail & Related papers (2021-11-24T08:03:48Z) - HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones.
We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains.
Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.