Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval
- URL: http://arxiv.org/abs/2305.07910v1
- Date: Sat, 13 May 2023 12:31:37 GMT
- Title: Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval
- Authors: Han Fang and Zhifei Yang and Xianghao Zang and Chao Ban and Hao Sun
- Abstract summary: Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
- Score: 19.61947785487129
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, masked video modeling has been widely explored and significantly
improved the model's understanding ability of visual regions at a local level.
However, existing methods usually adopt random masking and follow the same
reconstruction paradigm to complete the masked regions, which do not leverage
the correlations between cross-modal content. In this paper, we present Mask
for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Specifically, after applying attention-based video masking to generate
high-informed and low-informed masks, we propose Informed Semantics Completion
to recover masked semantics information. The recovery mechanism is achieved by
aligning the masked content with the unmasked visual regions and corresponding
textual context, which makes the model capture more text-related details at a
patch level. Additionally, we shift the emphasis of reconstruction from
irrelevant backgrounds to discriminative parts to ignore regions with
low-informed masks. Furthermore, we design dual-mask co-learning to incorporate
video cues under different masks and learn more aligned video representation.
Our MASCOT performs state-of-the-art performance on four major text-video
retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.
Extensive ablation studies demonstrate the effectiveness of the proposed
schemes.
Related papers
- ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking [5.844539603252746]
Masked image modeling (MIM) has shown effectiveness by reconstructing randomly masked images to learn detailed representations.
We propose AnatoMask, a novel MIM method that leverages reconstruction loss to dynamically identify and mask out anatomically significant regions.
arXiv Detail & Related papers (2024-07-09T00:15:52Z) - Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM)
Our approach introduces canonical class-aware glyph masks to suppress background and text style noise.
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z) - Automatic Generation of Semantic Parts for Face Image Synthesis [7.728916126705043]
We describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks.
Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited.
We report quantitative and qualitative results on the Celeb-MaskHQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level.
arXiv Detail & Related papers (2023-07-11T15:01:42Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with
Informative-Preserved Reconstruction and Self-Distilled Consistency [120.9499803967496]
We propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points.
Our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction.
By combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded.
arXiv Detail & Related papers (2022-12-20T01:53:40Z) - Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Contrastive Context-Aware Learning for 3D High-Fidelity Mask Face
Presentation Attack Detection [103.7264459186552]
Face presentation attack detection (PAD) is essential to secure face recognition systems.
Most existing 3D mask PAD benchmarks suffer from several drawbacks.
We introduce a largescale High-Fidelity Mask dataset to bridge the gap to real-world applications.
arXiv Detail & Related papers (2021-04-13T12:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.