Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer
Using Patches
- URL: http://arxiv.org/abs/2207.00113v1
- Date: Thu, 30 Jun 2022 21:57:33 GMT
- Title: Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer
Using Patches
- Authors: Mengya Xu and Mobarakol Islam and Hongliang Ren
- Abstract summary: Surgical captioning plays an important role in surgical instruction prediction and report generation.
Most captioning models still rely on the heavy computational object detector or feature extractor to extract regional features.
We design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique.
- Score: 20.020356453279685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surgical captioning plays an important role in surgical instruction
prediction and report generation. However, the majority of captioning models
still rely on the heavy computational object detector or feature extractor to
extract regional features. In addition, the detection model requires additional
bounding box annotation which is costly and needs skilled annotators. These
lead to inference delay and limit the captioning model to deploy in real-time
robotic surgery. For this purpose, we design an end-to-end detector and feature
extractor-free captioning model by utilizing the patch-based shifted window
technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer
Captioning model (SwinMLP-TranCAP) with faster inference speed and less
computation. SwinMLP-TranCAP replaces the multi-head attention module with
window-based multi-head MLP. Such deployments primarily focus on image
understanding tasks, but very few works investigate the caption generation
task. SwinMLP-TranCAP is also extended into a video version for video
captioning tasks using 3D patches and windows. Compared with previous
detector-based or feature extractor-based models, our models greatly simplify
the architecture design while maintaining performance on two surgical datasets.
The code is publicly available at
https://github.com/XuMengyaAmy/SwinMLP_TranCAP.
Related papers
- PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs.
Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Sliding Window FastEdit: A Framework for Lesion Annotation in Whole-body
PET Images [24.7560446107659]
Deep learning has revolutionized the accurate segmentation of diseases in medical imaging.
This requirement presents a challenge for whole-body Positron Emission Tomography (PET) imaging, where lesions are scattered throughout the body.
We introduce SW-FastEdit - an interactive segmentation framework that accelerates the labeling by utilizing only a few user clicks instead of voxelwise annotations.
Our model outperforms existing non-sliding window interactive models on the AutoPET dataset and generalizes to the previously unseen HECKTOR dataset.
arXiv Detail & Related papers (2023-11-24T13:45:58Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information.
We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.