Frame Aggregation and Multi-Modal Fusion Framework for Video-Based
Person Recognition
- URL: http://arxiv.org/abs/2010.09290v2
- Date: Wed, 30 Dec 2020 09:01:06 GMT
- Title: Frame Aggregation and Multi-Modal Fusion Framework for Video-Based
Person Recognition
- Authors: Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan, Bin Wu
- Abstract summary: We propose a Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition.
FAMF aggregates face features and incorporates them with multi-modal information to identify persons in videos.
We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames.
- Score: 13.875674649636874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person recognition is challenging due to persons being blocked
and blurred, and the variation of shooting angle. Previous research always
focused on person recognition on still images, ignoring similarity and
continuity between video frames. To tackle the challenges above, we propose a
novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based
person recognition, which aggregates face features and incorporates them with
multi-modal information to identify persons in videos. For frame aggregation,
we propose a novel trainable layer based on NetVLAD (named AttentionVLAD),
which takes arbitrary number of features as input and computes a fixed-length
aggregation feature based on feature quality. We show that introducing an
attention mechanism to NetVLAD can effectively decrease the impact of
low-quality frames. For the multi-model information of videos, we propose a
Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of
multi-modality by adaptively updating Gram matrix. Experimental results on
iQIYI-VID-2019 dataset show that our framework outperforms other
state-of-the-art methods.
Related papers
- A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.
We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.
We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation [11.611045114232187]
Recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance.
We try to synthesize more virtual camera views by flow-based video frame making (VFI)
For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm.
We construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth.
arXiv Detail & Related papers (2024-07-19T08:51:51Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Multilevel Hierarchical Network with Multiscale Sampling for Video
Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA.
MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR)
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.