YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in
Domain-Specific Videos
- URL: http://arxiv.org/abs/2004.05573v1
- Date: Sun, 12 Apr 2020 09:25:36 GMT
- Title: YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in
Domain-Specific Videos
- Authors: Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin
- Abstract summary: The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos.
We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities.
- Score: 60.62475495522428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark
for fine-grained action understanding in domain-specific videos e.g. makeup
instructional videos. We propose two novel question-answering tasks to evaluate
models' fine-grained action understanding abilities. The first task is
\textbf{Facial Image Ordering}, which aims to understand visual effects of
different actions expressed in natural language to the facial object. The
second task is \textbf{Step Ordering}, which aims to measure cross-modal
semantic alignments between untrimmed videos and multi-sentence texts. In this
paper, we present the challenge guidelines, the dataset used, and performances
of baseline models on the two proposed tasks. The baseline codes and models are
released at \url{https://github.com/AIM3-RUC/YouMakeup_Baseline}.
Related papers
- VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box.
To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
arXiv Detail & Related papers (2023-05-15T07:12:19Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z) - Exploiting Feature Diversity for Make-up Temporal Video Grounding [15.358540603177547]
This report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022.
MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description.
arXiv Detail & Related papers (2022-08-12T09:03:25Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Scene Graph Reasoning for Visual Question Answering [23.57543808056452]
We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene.
A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers.
arXiv Detail & Related papers (2020-07-02T13:02:54Z) - Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene
Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN)
It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively.
It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.