Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos
- URL: http://arxiv.org/abs/2506.17403v1
- Date: Fri, 20 Jun 2025 18:08:41 GMT
- Title: Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos
- Authors: Zhiyi Shi, Junsik Kim, Helen Y. Yang, Yonghyun Song, Hyun-Jic Oh, Dalit Ben-Yosef, Daniel Needleman, Hanspeter Pfister,
- Abstract summary: Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction.<n> existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges.<n>We propose Spatial-Temporal Pre-Training (STPT) to address these challenges.<n>STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations.<n>The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability.
- Score: 20.617772805817154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automating embryo viability prediction for in vitro fertilization (IVF) is important but challenging due to the limited availability of labeled pregnancy outcome data, as only a small fraction of embryos are labeled after transfer. Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction. However, existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges: (1) embryo time-lapse videos contain hundreds of frames, requiring significant GPU memory for conventional SSL; (2) the dataset contains videos with varying lengths and many outlier frames, causing traditional video alignment methods to struggle with semantic misalignment. We propose Spatial-Temporal Pre-Training (STPT) to address these challenges. STPT includes two stages: spatial and temporal. In each stage, only one encoder is trained while the other is frozen, reducing memory demands. To handle temporal misalignment, STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations. The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability. On 23,027 time-lapse videos (3,286 labeled), STPT achieves the highest AUC of 0.635 (95% CI: 0.632-0.638) compared to baselines, with limited computational resources.
Related papers
- TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning [43.682537904776616]
TiS-TSL is a timeswitchable teacher-student learning framework for video stereo matching under minimal supervision.<n>A unified model operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP)
arXiv Detail & Related papers (2025-11-10T08:01:26Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild [30.99346916414286]
We propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos.<n>Direct pretraining on imperfect pseudo annotations presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries.<n>We propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction.
arXiv Detail & Related papers (2024-12-01T13:49:21Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>We propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.