VideoSum: A Python Library for Surgical Video Summarization
- URL: http://arxiv.org/abs/2303.10173v2
- Date: Fri, 14 Jul 2023 16:49:42 GMT
- Title: VideoSum: A Python Library for Surgical Video Summarization
- Authors: Luis C. Garcia-Peraza-Herrera, Sebastien Ourselin and Tom Vercauteren
- Abstract summary: We propose to summarize surgical videos into storyboards or collages of representative frames to ease visualization, annotation, and processing.
We present videosum, an easy-to-use and open-source Python library to generate storyboards from surgical videos.
- Score: 3.928145224623878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of deep learning (DL) algorithms is heavily influenced by the
quantity and the quality of the annotated data. However, in Surgical Data
Science, access to it is limited. It is thus unsurprising that substantial
research efforts are made to develop methods aiming at mitigating the scarcity
of annotated SDS data. In parallel, an increasing number of Computer Assisted
Interventions (CAI) datasets are being released, although the scale of these
remain limited. On these premises, data curation is becoming a key element of
many SDS research endeavors. Surgical video datasets are demanding to curate
and would benefit from dedicated support tools. In this work, we propose to
summarize surgical videos into storyboards or collages of representative frames
to ease visualization, annotation, and processing. Video summarization is
well-established for natural images. However, state-of-the-art methods
typically rely on models trained on human-made annotations, few methods have
been evaluated on surgical videos, and the availability of software packages
for the task is limited. We present videosum, an easy-to-use and open-source
Python library to generate storyboards from surgical videos that contains a
variety of unsupervised methods.
Related papers
- LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning [15.646322352232819]
We create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs.
We propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge.
We train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos.
arXiv Detail & Related papers (2024-08-15T07:00:20Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Correlation-aware active learning for surgery video segmentation [13.327429312047396]
This work proposes a novel AL strategy for surgery video segmentation, COWAL, COrrelation-aWare Active Learning.
Our approach involves projecting images into a latent space that has been fine-tuned using contrastive learning and then selecting a fixed number of representative images from local clusters of video frames.
We demonstrate the effectiveness of this approach on two video datasets of surgical instruments and three real-world video datasets.
arXiv Detail & Related papers (2023-11-15T09:30:52Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - SurgMAE: Masked Autoencoders for Long Surgical Video Analysis [4.866110274299399]
Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs)
In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain.
We propose SurgMAE, which is a novel architecture with a masking strategy on sampling high-temporal tokens for MAE.
arXiv Detail & Related papers (2023-05-19T06:12:50Z) - AutoLaparo: A New Dataset of Integrated Multi-tasks for Image-guided
Surgical Automation in Laparoscopic Hysterectomy [42.20922574566824]
We present and release the first integrated dataset with multiple image-based perception tasks to facilitate learning-based automation in hysterectomy surgery.
Our AutoLaparo dataset is developed based on full-length videos of entire hysterectomy procedures.
Specifically, three different yet highly correlated tasks are formulated in the dataset, including surgical workflow recognition, laparoscope motion prediction, and instrument and key anatomy segmentation.
arXiv Detail & Related papers (2022-08-03T13:17:23Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - LRTD: Long-Range Temporal Dependency based Active Learning for Surgical
Workflow Recognition [67.86810761677403]
We propose a novel active learning method for cost-effective surgical video analysis.
Specifically, we propose a non-local recurrent convolutional network (NL-RCNet), which introduces non-local block to capture the long-range temporal dependency.
We validate our approach on a large surgical video dataset (Cholec80) by performing surgical workflow recognition task.
arXiv Detail & Related papers (2020-04-21T09:21:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.