QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View
- URL: http://arxiv.org/abs/2407.13216v1
- Date: Thu, 18 Jul 2024 06:55:26 GMT
- Title: QuIIL at T3 challenge: Towards Automation in Life-Saving Intervention Procedures from First-Person View
- Authors: Trinh T. L. Vuong, Doanh C. Bui, Jin Tae Kwak,
- Abstract summary: We present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge.
For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image.
For training, we present an action dictionary-guided design, which consistently yields the most favorable results.
- Score: 2.3982875575861677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present our solutions for a spectrum of automation tasks in life-saving intervention procedures within the Trauma THOMPSON (T3) Challenge, encompassing action recognition, action anticipation, and Visual Question Answering (VQA). For action recognition and anticipation, we propose a pre-processing strategy that samples and stitches multiple inputs into a single image and then incorporates momentum- and attention-based knowledge distillation to improve the performance of the two tasks. For training, we present an action dictionary-guided design, which consistently yields the most favorable results across our experiments. In the realm of VQA, we leverage object-level features and deploy co-attention networks to train both object and question features. Notably, we introduce a novel frame-question cross-attention mechanism at the network's core for enhanced performance. Our solutions achieve the $2^{nd}$ rank in action recognition and anticipation tasks and $1^{st}$ rank in the VQA task.
Related papers
- About Time: Advances, Challenges, and Outlooks of Action Understanding [57.76390141287026]
This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks.
We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances.
arXiv Detail & Related papers (2024-11-22T18:09:27Z) - Affective Behaviour Analysis via Progressive Learning [23.455163723584427]
We present our methods and experimental results for the two competition tracks.
We train a Masked-Auto in a self-supervised manner to attain high-quality facial features.
We utilize curriculum learning to transition the model from recognizing single expressions to recognizing compound expressions.
arXiv Detail & Related papers (2024-07-24T02:24:21Z) - Affective Behavior Analysis using Task-adaptive and AU-assisted Graph Network [18.304164382834617]
We present our solution and experiment result for the Multi-Task Learning Challenge of the 7th Affective Behavior Analysis in-the-wild(ABAW7) Competition.
This challenge consists of three tasks: action unit detection, facial expression recognition, and valance-arousal estimation.
arXiv Detail & Related papers (2024-07-16T12:33:22Z) - Devil's Advocate: Anticipatory Reflection for LLM Agents [53.897557605550325]
Our approach prompts LLM agents to decompose a given task into manageable subtasks.
We implement a three-fold introspective intervention:.
Anticipatory reflection on potential failures and alternative remedy before action execution.
Post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution.
arXiv Detail & Related papers (2024-05-25T19:20:15Z) - Introducing "Forecast Utterance" for Conversational Data Science [2.3894779000840503]
This paper introduces a new concept called Forecast Utterance.
We then focus on the automatic and accurate interpretation of users' prediction goals from these utterances.
Specifically, we frame the task as a slot-filling problem, where each slot corresponds to a specific aspect of the goal prediction task.
We then employ two zero-shot methods for solving the slot-filling task, namely: 1) Entity Extraction (EE), and 2) Question-Answering (QA) techniques.
arXiv Detail & Related papers (2023-09-07T17:41:41Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Winning the ICCV'2021 VALUE Challenge: Task-aware Ensemble and Transfer
Learning with Visual Concepts [20.412239939287886]
The VALUE (Video-And-Language Understanding Evaluation) benchmark is newly introduced to evaluate and analyze multi-modal representation learning algorithms.
The main objective of the VALUE challenge is to train a task-agnostic model that is simultaneously applicable for various tasks with different characteristics.
This technical report describes our winning strategies for the VALUE challenge: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble.
arXiv Detail & Related papers (2021-10-13T03:50:07Z) - Planning to Explore via Self-Supervised World Models [120.31359262226758]
Plan2Explore is a self-supervised reinforcement learning agent.
We present a new approach to self-supervised exploration and fast adaptation to new tasks.
Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods.
arXiv Detail & Related papers (2020-05-12T17:59:45Z) - Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video [27.391434284586985]
Rolling-Unrolling LSTM is a learning architecture to anticipate actions from egocentric videos.
The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet.
arXiv Detail & Related papers (2020-05-04T14:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.