Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos
- URL: http://arxiv.org/abs/2006.00545v1
- Date: Sun, 31 May 2020 15:46:01 GMT
- Title: Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos
- Authors: Ajay Kumar Tanwani, Pierre Sermanet, Andy Yan, Raghav Anand, Mariano
Phielipp, Ken Goldberg
- Abstract summary: We learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options.
We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations.
We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset.
- Score: 23.153335327822685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning meaningful visual representations in an embedding space can
facilitate generalization in downstream tasks such as action segmentation and
imitation. In this paper, we learn a motion-centric representation of surgical
video demonstrations by grouping them into action segments/sub-goals/options in
a semi-supervised manner. We present Motion2Vec, an algorithm that learns a
deep embedding feature space from video observations by minimizing a metric
learning loss in a Siamese network: images from the same action segment are
pulled together while pushed away from randomly sampled images of other
segments, while respecting the temporal ordering of the images. The embeddings
are iteratively segmented with a recurrent neural network for a given
parametrization of the embedding space after pre-training the Siamese network.
We only use a small set of labeled video segments to semantically align the
embedding space and assign pseudo-labels to the remaining unlabeled data by
inference on the learned model parameters. We demonstrate the use of this
representation to imitate surgical suturing motions from publicly available
videos of the JIGSAWS dataset. Results give 85.5 % segmentation accuracy on
average suggesting performance improvement over several state-of-the-art
baselines, while kinematic pose imitation gives 0.94 centimeter error in
position per observation on the test set. Videos, code and data are available
at https://sites.google.com/view/motion2vec
Related papers
- Guess What Moves: Unsupervised Video and Image Segmentation by
Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation.
We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns.
In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z) - Min-Max Similarity: A Contrastive Learning Based Semi-Supervised
Learning Network for Surgical Tools Segmentation [0.0]
We propose a semi-supervised segmentation network based on contrastive learning.
In contrast to the previous state-of-the-art, we introduce a contrastive learning form of dual-view training.
Our proposed method outperforms state-of-the-art semi-supervised and fully supervised segmentation algorithms consistently.
arXiv Detail & Related papers (2022-03-29T01:40:26Z) - The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis.
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Weakly Supervised Instance Segmentation for Videos with Temporal Mask
Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models.
We show that these issues can be better addressed by training with weakly labeled videos instead of images.
We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z) - ASIST: Annotation-free synthetic instance segmentation and tracking for
microscope video analysis [8.212196747588361]
We propose a novel annotation-free synthetic instance segmentation and tracking (ASIST) algorithm for analyzing microscope videos of sub-cellular microvilli.
From the experimental results, the proposed annotation-free method achieved superior performance compared with supervised learning.
arXiv Detail & Related papers (2020-11-02T14:39:26Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.