A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step
Inference
- URL: http://arxiv.org/abs/2306.14412v1
- Date: Mon, 26 Jun 2023 04:19:33 GMT
- Title: A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step
Inference
- Authors: Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen
- Abstract summary: Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario.
We present a solution for enhancing video alignment to improve multi-step inference.
Our method secured the 2nd place in CVPR'2023 AQTC challenge.
- Score: 51.26551806938455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Affordance-centric Question-driven Task Completion (AQTC) for Egocentric
Assistant introduces a groundbreaking scenario. In this scenario, through
learning instructional videos, AI assistants provide users with step-by-step
guidance on operating devices. In this paper, we present a solution for
enhancing video alignment to improve multi-step inference. Specifically, we
first utilize VideoCLIP to generate video-script alignment features.
Afterwards, we ground the question-relevant content in instructional videos.
Then, we reweight the multimodal context to emphasize prominent features.
Finally, we adopt GRU to conduct multi-step inference. Through comprehensive
experiments, we demonstrate the effectiveness and superiority of our method,
which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available
at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.
Related papers
- Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos.
Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations.
To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z) - 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [12.274092278786966]
Video Panoptic (VPS) aims to simultaneously classify, track, segment all objects in a video.
We propose a robust integrated video panoptic segmentation solution.
Our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases.
arXiv Detail & Related papers (2024-06-01T17:03:16Z) - Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis [5.4598424549754965]
This paper introduces our solution for Track 2 in AI City Challenge 2024.
The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety.
Our solution has yielded on the test set, achieving 6th place in the competition.
arXiv Detail & Related papers (2024-04-12T04:08:21Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Winning the CVPR'2022 AQTC Challenge: A Two-stage Function-centric
Approach [51.424201533529114]
Affordance-centric Question-driven Task Completion for Egocentric Assistant(AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step.
We deal with the AQTC via a two-stage Function-centric approach, which consists of Question2Function Module to ground the question with the related function and Function2Answer Module to predict the action based on the historical steps.
arXiv Detail & Related papers (2022-06-20T07:02:23Z) - AssistQ: Affordance-centric Question-driven Task Completion for
Egocentric Assistant [6.379158555341729]
We define a new task called Affordance-centric Question-driven Task Completion.
The AI assistant should learn from instructional videos and scripts to guide the user step-by-step.
To support the task, we constructed AssistQ, a new dataset comprising 529 question-answer samples.
arXiv Detail & Related papers (2022-03-08T17:07:09Z) - AssistSR: Affordance-centric Question-driven Video Segment Retrieval [4.047098915826058]
Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
We present a new task called Affordance-centric Question-driven Video Segment Retrieval (AQVSR)
arXiv Detail & Related papers (2021-11-30T01:14:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.