Related papers: Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos

Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos

URL: http://arxiv.org/abs/2006.00545v1
Date: Sun, 31 May 2020 15:46:01 GMT
Title: Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos
Authors: Ajay Kumar Tanwani, Pierre Sermanet, Andy Yan, Raghav Anand, Mariano Phielipp, Ken Goldberg
Abstract summary: We learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options. We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations. We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset.
Score: 23.153335327822685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning meaningful visual representations in an embedding space can facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations by minimizing a metric learning loss in a Siamese network: images from the same action segment are pulled together while pushed away from randomly sampled images of other segments, while respecting the temporal ordering of the images. The embeddings are iteratively segmented with a recurrent neural network for a given parametrization of the embedding space after pre-training the Siamese network. We only use a small set of labeled video segments to semantically align the embedding space and assign pseudo-labels to the remaining unlabeled data by inference on the learned model parameters. We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset. Results give 85.5 % segmentation accuracy on average suggesting performance improvement over several state-of-the-art baselines, while kinematic pose imitation gives 0.94 centimeter error in position per observation on the test set. Videos, code and data are available at https://sites.google.com/view/motion2vec

Related papers

Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns. In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z)
Min-Max Similarity: A Contrastive Learning Based Semi-Supervised Learning Network for Surgical Tools Segmentation [0.0]
We propose a semi-supervised segmentation network based on contrastive learning. In contrast to the previous state-of-the-art, we introduce a contrastive learning form of dual-view training. Our proposed method outperforms state-of-the-art semi-supervised and fully supervised segmentation algorithms consistently.
arXiv Detail & Related papers (2022-03-29T01:40:26Z)
The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis. Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images. By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z)
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models. We show that these issues can be better addressed by training with weakly labeled videos instead of images. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z)
ASIST: Annotation-free synthetic instance segmentation and tracking for microscope video analysis [8.212196747588361]
We propose a novel annotation-free synthetic instance segmentation and tracking (ASIST) algorithm for analyzing microscope videos of sub-cellular microvilli. From the experimental results, the proposed annotation-free method achieved superior performance compared with supervised learning.
arXiv Detail & Related papers (2020-11-02T14:39:26Z)
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem. We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion. A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)
DyStaB: Unsupervised Object Segmentation via Dynamic-Static Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole. Our method first partitions the motion field by minimizing the mutual information between segments. It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
Unsupervised Learning of Video Representations via Dense Trajectory Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.