Cell Behavior Video Classification Challenge, a benchmark for computer vision methods in time-lapse microscopy
- URL: http://arxiv.org/abs/2601.10250v1
- Date: Thu, 15 Jan 2026 10:14:16 GMT
- Title: Cell Behavior Video Classification Challenge, a benchmark for computer vision methods in time-lapse microscopy
- Authors: Raffaella Fiamma Cabini, Deborah Barkauskas, Guangyu Chen, Zhi-Qi Cheng, David E Cicchetti, Judith Drazba, Rodrigo Fernandez-Gonzalez, Raymond Hawkins, Yujia Hu, Jyoti Kini, Charles LeWarne, Xufeng Lin, Sai Preethi Nakkina, John W Peterson, Koert Schreurs, Ayushi Singh, Kumaran Bala Kandan Viswanathan, Inge MN Wortel, Sanjian Zhang, Rolf Krause, Santiago Fernandez Gonzalez, Diego Ulisse Pizzagalli,
- Abstract summary: We present a Cell Behavior Video Classification Challenge (VCCB) benchmarking 35 methods based on three approaches.<n>We compare the potential and limitations of each approach, serving as a basis to the development of computer vision methods for studying cellular dynamics.
- Score: 11.497260442673989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The classification of microscopy videos capturing complex cellular behaviors is crucial for understanding and quantifying the dynamics of biological processes over time. However, it remains a frontier in computer vision, requiring approaches that effectively model the shape and motion of objects without rigid boundaries, extract hierarchical spatiotemporal features from entire image sequences rather than static frames, and account for multiple objects within the field of view. To this end, we organized the Cell Behavior Video Classification Challenge (CBVCC), benchmarking 35 methods based on three approaches: classification of tracking-derived features, end-to-end deep learning architectures to directly learn spatiotemporal features from the entire video sequence without explicit cell tracking, or ensembling tracking-derived with image-derived features. We discuss the results achieved by the participants and compare the potential and limitations of each approach, serving as a basis to foster the development of computer vision methods for studying cellular dynamics.
Related papers
- FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Segment Anything for Cell Tracking [2.0382881548515575]
We propose a zero-shot cell tracking framework for time-lapse microscopy images.<n>As a fully-unsupervised approach, our method does not depend on or inherit biases from any specific training dataset.<n>Our approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos.
arXiv Detail & Related papers (2025-09-12T03:19:35Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - Deep Temporal Sequence Classification and Mathematical Modeling for Cell Tracking in Dense 3D Microscopy Videos of Bacterial Biofilms [18.563062576080704]
We introduce a novel cell tracking algorithm named DenseTrack.
DenseTrack integrates deep learning with mathematical model-based strategies to establish correspondences between consecutive frames.
We present an eigendecomposition-based cell division detection strategy.
arXiv Detail & Related papers (2024-06-27T23:26:57Z) - Grad-CAMO: Learning Interpretable Single-Cell Morphological Profiles from 3D Cell Painting Images [0.0]
We introduce Grad-CAMO, a novel single-cell interpretability score for supervised feature extractors.
Grad-CAMO measures the proportion of a model's attention that is concentrated on the cell of interest versus the background.
arXiv Detail & Related papers (2024-03-26T11:48:37Z) - Enhancing Cell Tracking with a Time-Symmetric Deep Learning Approach [0.34089646689382486]
We develop a new deep-learning based tracking method that relies solely on the assumption that cells can be tracked based on theirtemporal neighborhood.<n>The proposed method has the additional benefit that the motion patterns of the cells can be learned completely by the predictor without any prior assumptions.
arXiv Detail & Related papers (2023-08-04T15:57:28Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - Towards Annotation-free Instance Segmentation and Tracking with
Adversarial Simulations [5.434831972326107]
In computer vision, annotated training data with consistent segmentation and tracking is resource intensive.
adversarial simulations have provided successful solutions in computer vision to train real-world self-driving systems.
This paper proposes an annotation-free synthetic instance segmentation and tracking (ASIST) method with adversarial simulation and single-stage pixel-embedding based learning.
arXiv Detail & Related papers (2021-01-03T07:04:13Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.