Activity Detection in Long Surgical Videos using Spatio-Temporal Models
- URL: http://arxiv.org/abs/2205.02805v1
- Date: Thu, 5 May 2022 17:34:33 GMT
- Title: Activity Detection in Long Surgical Videos using Spatio-Temporal Models
- Authors: Aidean Sharghi, Zooey He, Omid Mohareri
- Abstract summary: In this paper, we investigate both the state-of-the-art activity recognition and temporal models.
We benchmark these models on a large-scale activity recognition dataset in the operating room with over 800 full-length surgical videos.
We show that even in the case of limited labeled data, we can outperform the existing work by benefiting from models pre-trained on other tasks.
- Score: 1.2400116527089995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic activity detection is an important component for developing
technologies that enable next generation surgical devices and workflow
monitoring systems. In many application, the videos of interest are long and
include several activities; hence, the deep models designed for such purposes
consist of a backbone and a temporal sequence modeling architecture. In this
paper, we investigate both the state-of-the-art activity recognition and
temporal models to find the architectures that yield the highest performance.
We first benchmark these models on a large-scale activity recognition dataset
in the operating room with over 800 full-length surgical videos. However, since
most other medical applications lack such a large dataset, we further evaluate
our models on the Cholec80 surgical phase segmentation dataset, consisting of
only 40 training videos. For backbone architectures, we investigate both 3D
ConvNets and most recent transformer-based models; for temporal modeling, we
include temporal ConvNets, RNNs, and transformer models for a comprehensive and
thorough study. We show that even in the case of limited labeled data, we can
outperform the existing work by benefiting from models pre-trained on other
tasks.
Related papers
- Scaling Wearable Foundation Models [54.93979158708164]
We investigate the scaling properties of sensor foundation models across compute, data, and model size.
Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM.
Our results establish the scaling laws of LSM for tasks such as imputation, extrapolation, both across time and sensor modalities.
arXiv Detail & Related papers (2024-10-17T15:08:21Z) - Low-resource finetuning of foundation models beats state-of-the-art in
histopathology [3.4577420145036375]
We benchmark the most popular vision foundation models as feature extractors for histopathology data.
By finetuning a foundation model on a single GPU for only two hours or three days depending on the dataset, we can match or outperform state-of-the-art feature extractors.
This is a considerable shift from the current state, where only few institutions with large amounts of resources and datasets are able to train a feature extractor.
arXiv Detail & Related papers (2024-01-09T18:46:59Z) - On the Relevance of Temporal Features for Medical Ultrasound Video
Recognition [0.0]
We propose a novel multi-head attention architecture to achieve better sample efficiency on common ultrasound tasks.
We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings.
These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime.
arXiv Detail & Related papers (2023-10-16T14:35:29Z) - Large Models for Time Series and Spatio-Temporal Data: A Survey and
Outlook [95.32949323258251]
Temporal data, notably time series andtemporal-temporal data, are prevalent in real-world applications.
Recent advances in large language and other foundational models have spurred increased use in time series andtemporal data mining.
arXiv Detail & Related papers (2023-10-16T09:06:00Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Automatic Operating Room Surgical Activity Recognition for
Robot-Assisted Surgery [1.1033115844630357]
We investigate automatic surgical activity recognition in robot-assisted operations.
We collect the first large-scale dataset including 400 full-length multi-perspective videos.
We densely annotate the videos with 10 most recognized and clinically relevant classes of activities.
arXiv Detail & Related papers (2020-06-29T16:30:31Z) - A Neuromorphic Proto-Object Based Dynamic Visual Saliency Model with an
FPGA Implementation [1.2387676601792899]
We present a neuromorphic, bottom-up, dynamic visual saliency model based on the notion of proto-objects.
This model outperforms state-of-the-art dynamic visual saliency models in predicting human eye fixations on a commonly used video dataset.
We introduce a Field-Programmable Gate Array implementation of the model on an Opal Kelly 7350 Kintex-7 board.
arXiv Detail & Related papers (2020-02-27T03:31:56Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z) - A Comprehensive Study on Temporal Modeling for Online Action Detection [50.558313106389335]
Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years.
This paper aims to provide a comprehensive study on temporal modeling for OAD including four meta types of temporal modeling methods.
We present several hybrid temporal modeling methods, which outperform the recent state-of-the-art methods with sizable margins on THUMOS-14 and TVSeries.
arXiv Detail & Related papers (2020-01-21T13:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.