QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me
Challenge
- URL: http://arxiv.org/abs/2306.17404v1
- Date: Fri, 30 Jun 2023 05:14:45 GMT
- Title: QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me
Challenge
- Authors: Hsi-Che Lin, Chien-Yi Wang, Min-Hung Chen, Szu-Wei Fu, Yu-Chiang Frank
Wang
- Abstract summary: This report describes our submission to the Ego4D Talking to Me (TTM) Challenge 2023.
We propose to use two separate models to process the input videos and audio.
With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set.
- Score: 35.08570071278399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This technical report describes our QuAVF@NTU-NVIDIA submission to the Ego4D
Talking to Me (TTM) Challenge 2023. Based on the observation from the TTM task
and the provided dataset, we propose to use two separate models to process the
input videos and audio. By doing so, we can utilize all the labeled training
data, including those without bounding box labels. Furthermore, we leverage the
face quality score from a facial landmark prediction model for filtering noisy
face input data. The face quality score is also employed in our proposed
quality-aware fusion for integrating the results from two branches. With the
simple architecture design, our model achieves 67.4% mean average precision
(mAP) on the test set, which ranks first on the leaderboard and outperforms the
baseline method by a large margin. Code is available at:
https://github.com/hsi-che-lin/Ego4D-QuAVF-TTM-CVPR23
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z) - Blind Face Restoration: Benchmark Datasets and a Baseline Model [63.053331687284064]
Blind Face Restoration (BFR) aims to construct a high-quality (HQ) face image from its corresponding low-quality (LQ) input.
We first synthesize two blind face restoration benchmark datasets called EDFace-Celeb-1M (BFR128) and EDFace-Celeb-150K (BFR512)
State-of-the-art methods are benchmarked on them under five settings including blur, noise, low resolution, JPEG compression artifacts, and the combination of them (full degradation)
arXiv Detail & Related papers (2022-06-08T06:34:24Z) - Ask2Mask: Guided Data Selection for Masked Speech Modeling [25.716834361963468]
Masked speech modeling (MSM) methods learn representations over speech frames which are randomly masked within an utterance.
They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations.
We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training.
arXiv Detail & Related papers (2022-02-24T17:34:54Z) - An Empirical Study of Vehicle Re-Identification on the AI City Challenge [19.13038665501964]
The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data.
We mainly focus on four points, i.e. training data, unsupervised domain-adaptive (UDA) training, post-processing, model ensembling in this challenge.
With aforementioned techniques, our method finally achieves 0.7445 mAP score, yielding the first place in the competition.
arXiv Detail & Related papers (2021-05-20T12:20:52Z) - AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and
Baseline Methods [6.320141734801679]
We present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark.
Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples.
We trained several deep learning based models and provide empirical evaluations using the benchmark.
arXiv Detail & Related papers (2020-08-03T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.